Skip to main content

Showing 1–28 of 28 results for author: Nagel, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.13175  [pdf, other

    cs.LG cs.AI

    Sparse High Rank Adapters

    Authors: Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath Ganapathy, Rafael Esteves, Shreya Kadambi, Shubhankar Borse, Paul Whatmough, Risheek Garrepalli, Mart Van Baalen, Harris Teague, Markus Nagel

    Abstract: Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30%… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  2. arXiv:2406.06385  [pdf, other

    cs.LG cs.AI cs.CL

    Low-Rank Quantization-Aware Training for LLMs

    Authors: Yelysei Bondarenko, Riccardo Del Chiaro, Markus Nagel

    Abstract: Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and memory efficient. Quantization-aware training (QAT) methods, generally produce the best quantized performance, however it comes at the cost of potentially long trai… ▽ More

    Submitted 20 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

  3. arXiv:2402.15319  [pdf, other

    cs.LG cs.CL

    GPTVQ: The Blessing of Dimensionality for LLM Quantization

    Authors: Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough

    Abstract: In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining un… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

  4. arXiv:2312.17244  [pdf, other

    cs.LG cs.CL

    The LLM Surgeon

    Authors: Tycho F. A. van der Ouderaa, Markus Nagel, Mart van Baalen, Yuki M. Asano, Tijmen Blankevoort

    Abstract: State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative… ▽ More

    Submitted 20 March, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

  5. arXiv:2310.01258  [pdf, other

    eess.IV cs.CV cs.LG

    MobileNVC: Real-time 1080p Neural Video Compression on a Mobile Device

    Authors: Ties van Rozendaal, Tushar Singhal, Hoang Le, Guillaume Sautiere, Amir Said, Krishna Buska, Anjuman Raha, Dimitris Kalatzis, Hitarth Mehta, Frank Mayer, Liang Zhang, Markus Nagel, Auke Wiggers

    Abstract: Neural video codecs have recently become competitive with standard codecs such as HEVC in the low-delay setting. However, most neural codecs are large floating-point networks that use pixel-dense war** operations for temporal modeling, making them too computationally expensive for deployment on mobile devices. Recent work has demonstrated that running a neural decoder in real time on mobile is f… ▽ More

    Submitted 15 November, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: Matches version published at WACV 2024

  6. arXiv:2309.01729  [pdf, other

    cs.LG cs.AI cs.CV

    Softmax Bias Correction for Quantized Generative Models

    Authors: Nilesh Prasad Pandey, Marios Fournarakis, Chirag Patel, Markus Nagel

    Abstract: Post-training quantization (PTQ) is the go-to compression technique for large generative models, such as stable diffusion or large language models. PTQ methods commonly keep the softmax activation in higher precision as it has been shown to be very sensitive to quantization noise. However, this can lead to a significant runtime and power overhead during inference on resource-constraint edge device… ▽ More

    Submitted 4 September, 2023; originally announced September 2023.

  7. arXiv:2308.09511  [pdf, other

    cs.CV

    ResQ: Residual Quantization for Video Perception

    Authors: Davide Abati, Haitam Ben Yahia, Markus Nagel, Amirhossein Habibian

    Abstract: This paper accelerates video perception, such as semantic segmentation and human pose estimation, by levering cross-frame redundancies. Unlike the existing approaches, which avoid redundant computations by war** the past features using optical-flow or by performing sparse convolutions on frame differences, we approach the problem from a new perspective: low-bit quantization. We observe that resi… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

    Comments: ICCV 2023

  8. arXiv:2307.04535  [pdf, other

    cs.LG cs.AI cs.CV

    QBitOpt: Fast and Accurate Bitwidth Reallocation during Training

    Authors: Jorn Peters, Marios Fournarakis, Markus Nagel, Mart van Baalen, Tijmen Blankevoort

    Abstract: Quantizing neural networks is one of the most effective methods for achieving efficient inference on mobile and embedded devices. In particular, mixed precision quantized (MPQ) networks, whose layers can be quantized to different bitwidths, achieve better task performance for the same resource constraint compared to networks with homogeneous bitwidths. However, finding the optimal bitwidth allocat… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

  9. arXiv:2307.02973  [pdf, other

    cs.LG

    Pruning vs Quantization: Which is Better?

    Authors: Andrey Kuzmin, Markus Nagel, Mart van Baalen, Arash Behboodi, Tijmen Blankevoort

    Abstract: Neural network pruning and quantization techniques are almost as old as neural networks themselves. However, to date only ad-hoc comparisons between the two have been published. In this paper, we set out to answer the question on which is better: neural network quantization or pruning? By answering this question, we hope to inform design decisions made on neural network hardware going forward. We… ▽ More

    Submitted 16 February, 2024; v1 submitted 6 July, 2023; originally announced July 2023.

  10. arXiv:2306.12929  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Quantizable Transformers: Removing Outliers by Hel** Attention Heads Do Nothing

    Authors: Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort

    Abstract: Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational t… ▽ More

    Submitted 9 November, 2023; v1 submitted 22 June, 2023; originally announced June 2023.

  11. arXiv:2303.17951  [pdf, other

    cs.LG

    FP8 versus INT8 for efficient deep learning inference

    Authors: Mart van Baalen, Andrey Kuzmin, Suparna S Nair, Yuwei Ren, Eric Mahurin, Chirag Patel, Sundar Subramanian, Sanghyuk Lee, Markus Nagel, Joseph Soriaga, Tijmen Blankevoort

    Abstract: Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive t… ▽ More

    Submitted 15 June, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

  12. arXiv:2302.05397  [pdf, other

    cs.LG

    A Practical Mixed Precision Algorithm for Post-Training Quantization

    Authors: Nilesh Prasad Pandey, Markus Nagel, Mart van Baalen, Yin Huang, Chirag Patel, Tijmen Blankevoort

    Abstract: Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks some layers are significantly more robust to quantization noise than others, leaving an important axi… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.

  13. arXiv:2211.16912  [pdf, other

    cs.LG

    Quadapter: Adapter for GPT-2 Quantization

    Authors: Minseop Park, Jaeseong You, Markus Nagel, Simyung Chang

    Abstract: Transformer language models such as GPT-2 are difficult to quantize because of outliers in activations leading to a large quantization error. To adapt to the error, one must use quantization-aware training, which entails a fine-tuning process based on the dataset and the training pipeline identical to those for the original model. Pretrained language models, however, often do not grant access to t… ▽ More

    Submitted 30 November, 2022; originally announced November 2022.

  14. arXiv:2208.09225  [pdf, other

    cs.LG

    FP8 Quantization: The Power of the Exponent

    Authors: Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, Tijmen Blankevoort

    Abstract: When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale instead. This paper in-depth investigates this benefit of the floating point format for neural network inference. We detail the choices that can be made for the FP8… ▽ More

    Submitted 23 February, 2024; v1 submitted 19 August, 2022; originally announced August 2022.

  15. arXiv:2207.11048  [pdf, other

    cs.LG

    Quantized Sparse Weight Decomposition for Neural Network Compression

    Authors: Andrey Kuzmin, Mart van Baalen, Markus Nagel, Arash Behboodi

    Abstract: In this paper, we introduce a novel method of neural network weight compression. In our method, we store weight tensors as sparse, quantized matrix factors, whose product is computed on the fly during inference to generate the target model's weights. We use projected gradient descent methods to find quantized and sparse factorization of the weight tensors. We show that this approach can be seen as… ▽ More

    Submitted 22 July, 2022; originally announced July 2022.

  16. arXiv:2206.10844  [pdf, other

    cs.LG cs.DC

    Quantization Robust Federated Learning for Efficient Inference on Heterogeneous Devices

    Authors: Kartik Gupta, Marios Fournarakis, Matthias Reisser, Christos Louizos, Markus Nagel

    Abstract: Federated Learning (FL) is a machine learning paradigm to distributively learn machine learning models from decentralized data that remains on-device. Despite the success of standard Federated optimization methods, such as Federated Averaging (FedAvg) in FL, the energy demands and hardware induced constraints for on-device learning have not been considered sufficiently in the literature. Specifica… ▽ More

    Submitted 22 June, 2022; originally announced June 2022.

  17. arXiv:2203.11086  [pdf, other

    cs.LG

    Overcoming Oscillations in Quantization-Aware Training

    Authors: Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, Tijmen Blankevoort

    Abstract: When training neural networks with simulated quantization, we observe that quantized weights can, rather unexpectedly, oscillate between two grid-points. The importance of this effect and its impact on quantization-aware training (QAT) are not well-understood or investigated in literature. In this paper, we delve deeper into the phenomenon of weight oscillations and show that it can lead to a sign… ▽ More

    Submitted 28 June, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: Published as oral paper at ICML 2022

  18. arXiv:2202.01290  [pdf, other

    cs.LG cs.CV

    Cyclical Pruning for Sparse Neural Networks

    Authors: Suraj Srinivas, Andrey Kuzmin, Markus Nagel, Mart van Baalen, Andrii Skliar, Tijmen Blankevoort

    Abstract: Current methods for pruning neural network weights iteratively apply magnitude-based pruning on the model weights and re-train the resulting model to recover lost accuracy. In this work, we show that such strategies do not allow for the recovery of erroneously pruned weights. To enable weight recovery, we propose a simple strategy called \textit{cyclical pruning} which requires the pruning schedul… ▽ More

    Submitted 2 February, 2022; originally announced February 2022.

  19. arXiv:2201.08442  [pdf, other

    cs.LG cs.AI cs.AR cs.PF cs.SE

    Neural Network Quantization with AI Model Efficiency Toolkit (AIMET)

    Authors: Sangeetha Siddegowda, Marios Fournarakis, Markus Nagel, Tijmen Blankevoort, Chirag Patel, Abhijit Khobare

    Abstract: While neural networks have advanced the frontiers in many machine learning applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is vital to integrating modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings, but the additi… ▽ More

    Submitted 20 January, 2022; originally announced January 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2106.08295

  20. arXiv:2112.11312  [pdf, other

    cs.LG cs.CV

    Implicit Neural Video Compression

    Authors: Yunfan Zhang, Ties van Rozendaal, Johann Brehmer, Markus Nagel, Taco Cohen

    Abstract: We propose a method to compress full-resolution video sequences with implicit neural representations. Each frame is represented as a neural network that maps coordinate positions to pixel values. We use a separate implicit network to modulate the coordinate inputs, which enables efficient motion compensation between frames. Together with a small residual network, this allows us to efficiently comp… ▽ More

    Submitted 21 December, 2021; originally announced December 2021.

  21. arXiv:2109.12948  [pdf, other

    cs.LG cs.AI cs.CL

    Understanding and Overcoming the Challenges of Efficient Transformer Quantization

    Authors: Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort

    Abstract: Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges -- namely, high dynam… ▽ More

    Submitted 27 September, 2021; originally announced September 2021.

  22. arXiv:2106.08295  [pdf, other

    cs.LG cs.AI cs.CV

    A White Paper on Neural Network Quantization

    Authors: Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, Tijmen Blankevoort

    Abstract: While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings but the additional noise… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

  23. arXiv:2105.04246  [pdf, other

    cs.LG cs.AI cs.CV

    In-Hindsight Quantization Range Estimation for Quantized Training

    Authors: Marios Fournarakis, Markus Nagel

    Abstract: Quantization techniques applied to the inference of deep neural networks have enabled fast and efficient execution on resource-constraint devices. The success of quantization during inference has motivated the academic community to explore fully quantized training, i.e. quantizing back-propagation as well. However, effective gradient quantization is still an open problem. Gradients are unbounded a… ▽ More

    Submitted 10 May, 2021; originally announced May 2021.

  24. arXiv:2005.07093  [pdf, other

    cs.LG cs.CV stat.ML

    Bayesian Bits: Unifying Quantization and Pruning

    Authors: Mart van Baalen, Christos Louizos, Markus Nagel, Rana Ali Amjad, Ying Wang, Tijmen Blankevoort, Max Welling

    Abstract: We introduce Bayesian Bits, a practical method for joint mixed precision quantization and pruning through gradient based optimization. Bayesian Bits employs a novel decomposition of the quantization operation, which sequentially considers doubling the bit width. At each new bit width, the residual error between the full precision value and the previously rounded value is quantized. We then decide… ▽ More

    Submitted 27 October, 2020; v1 submitted 14 May, 2020; originally announced May 2020.

  25. arXiv:2004.10568  [pdf, other

    cs.LG cs.CV stat.ML

    Up or Down? Adaptive Rounding for Post-Training Quantization

    Authors: Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, Tijmen Blankevoort

    Abstract: When quantizing neural networks, assigning each floating-point weight to its nearest fixed-point value is the predominant approach. We find that, perhaps surprisingly, this is not the best we can do. In this paper, we propose AdaRound, a better weight-rounding mechanism for post-training quantization that adapts to the data and the task loss. AdaRound is fast, does not require fine-tuning of the n… ▽ More

    Submitted 30 June, 2020; v1 submitted 22 April, 2020; originally announced April 2020.

    Comments: Published as a conference paper at ICML 2020

  26. arXiv:2004.09576  [pdf, other

    cs.CV cs.LG stat.ML

    LSQ+: Improving low-bit quantization through learnable offsets and better initialization

    Authors: Yash Bhalgat, **won Lee, Markus Nagel, Tijmen Blankevoort, Nojun Kwak

    Abstract: Unlike ReLU, newer activation functions (like Swish, H-swish, Mish) that are frequently employed in popular efficient architectures can also result in negative activation values, with skewed positive and negative ranges. Typical learnable quantization schemes [PACT, LSQ] assume unsigned quantization for activations and quantize all negative activations to zero which leads to significant loss in pe… ▽ More

    Submitted 20 April, 2020; originally announced April 2020.

    Comments: Camera-ready for Joint Workshop on Efficient Deep Learning in Computer Vision, CVPR 2020

  27. arXiv:1912.09802  [pdf, other

    cs.LG cs.CV stat.ML

    Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks

    Authors: Andrey Kuzmin, Markus Nagel, Saurabh Pitre, Sandeep Pendyam, Tijmen Blankevoort, Max Welling

    Abstract: The success of deep neural networks in many real-world applications is leading to new challenges in building more efficient architectures. One effective way of making networks more efficient is neural network compression. We provide an overview of existing neural network compression methods that can be used to make neural networks more efficient by changing the architecture of the network. First,… ▽ More

    Submitted 20 December, 2019; originally announced December 2019.

  28. arXiv:1906.04721  [pdf, other

    cs.LG cs.CV stat.ML

    Data-Free Quantization Through Weight Equalization and Bias Correction

    Authors: Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling

    Abstract: We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computer vision architectures and tasks. 8-bit fixed-point quantization is essential for efficient inference on modern deep learning hardware. However, quantizing models to run in 8-bit is a non-trivial task, freq… ▽ More

    Submitted 25 November, 2019; v1 submitted 11 June, 2019; originally announced June 2019.

    Comments: ICCV 2019

    Journal ref: The IEEE International Conference on Computer Vision (ICCV), 2019