Skip to main content

Showing 1–17 of 17 results for author: Ashkboos, S

.
  1. arXiv:2404.00456  [pdf, other

    cs.LG

    QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

    Authors: Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman

    Abstract: We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot rotates LLMs in a way that removes outliers from the hidden state without changing the output, making quantization easier. This computational invariance is applied to the hidden state (residual) of the LLM, as well as to th… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

    Comments: 19 pages, 6 figures

  2. Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication

    Authors: Lukas Gianinazzi, Alexandros Nikolaos Ziogas, Langwen Huang, Piotr Luczynski, Saleh Ashkboos, Florian Scheidl, Armon Carigiet, Chio Ge, Nabil Abubaker, Maciej Besta, Tal Ben-Nun, Torsten Hoefler

    Abstract: We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data transfer becomes a bottleneck. An approach based on dense matrix multiplication algorithms leads to suboptimal scalability and fails to exploit th… ▽ More

    Submitted 20 March, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    ACM Class: F.2.1

    Journal ref: PPoPP'24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (2024) 404-416

  3. arXiv:2401.15024  [pdf, other

    cs.LG cs.CL

    SliceGPT: Compress Large Language Models by Deleting Rows and Columns

    Authors: Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman

    Abstract: Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data s… ▽ More

    Submitted 9 February, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

    Comments: 22 pages, 8 figures, accepted at ICLR24

  4. arXiv:2310.09259  [pdf, other

    cs.LG

    QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

    Authors: Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

    Abstract: Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantization, which can reduce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address them in compute-bound scenarios,… ▽ More

    Submitted 2 November, 2023; v1 submitted 13 October, 2023; originally announced October 2023.

    Comments: 16 pages

  5. arXiv:2306.03078  [pdf, other

    cs.CL cs.LG

    SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

    Authors: Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh

    Abstract: Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops and mobile phones, enabling personalized use. However, quantization down to 3-4 bits per parameter usually leads to moderate-to-high accuracy losses, especiall… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: Extended preprint

  6. arXiv:2304.07613  [pdf, other

    cs.LG

    STen: Productive and Efficient Sparsity in PyTorch

    Authors: Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Saleh Ashkboos, Torsten Hoefler

    Abstract: As deep learning models grow, sparsity is becoming an increasingly critical component of deep neural networks, enabling improved performance and reduced storage. However, existing frameworks offer poor support for sparsity. Specialized sparsity engines focus exclusively on sparse inference, while general frameworks primarily focus on sparse tensors in classical formats and neglect the broader spar… ▽ More

    Submitted 15 April, 2023; originally announced April 2023.

  7. arXiv:2210.17323  [pdf, other

    cs.LG

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Authors: Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

    Abstract: Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models.… ▽ More

    Submitted 22 March, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

    Comments: ICLR 2023

  8. arXiv:2208.11469  [pdf, other

    cs.DC cs.DS

    ProbGraph: High-Performance and High-Accuracy Graph Mining with Probabilistic Set Representations

    Authors: Maciej Besta, Cesare Miglioli, Paolo Sylos Labini, Jakub Tětek, Patrick Iff, Raghavendra Kanakagiri, Saleh Ashkboos, Kacper Janda, Michal Podstawski, Grzegorz Kwasniewski, Niels Gleinig, Flavio Vella, Onur Mutlu, Torsten Hoefler

    Abstract: Important graph mining problems such as Clustering are computationally demanding. To significantly accelerate these problems, we propose ProbGraph: a graph representation that enables simple and fast approximate parallel graph mining with strong theoretical guarantees on work, depth, and result accuracy. The key idea is to represent sets of vertices using probabilistic set representations such as… ▽ More

    Submitted 21 November, 2022; v1 submitted 24 August, 2022; originally announced August 2022.

    Comments: Best Paper Award at ACM/IEEE Supercomputing'22 (SC22)

    Journal ref: Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis, November 2022

  9. arXiv:2206.14786  [pdf, other

    cs.LG physics.ao-ph

    ENS-10: A Dataset For Post-Processing Ensemble Weather Forecasts

    Authors: Saleh Ashkboos, Langwen Huang, Nikoli Dryden, Tal Ben-Nun, Peter Dueben, Lukas Gianinazzi, Luca Kummer, Torsten Hoefler

    Abstract: Post-processing ensemble prediction systems can improve the reliability of weather forecasting, especially for extreme event prediction. In recent years, different machine learning models have been developed to improve the quality of weather post-processing. However, these models require a comprehensive dataset of weather simulations to produce high-accuracy results, which comes at a high computat… ▽ More

    Submitted 7 November, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

    Comments: Accepted version of the paper

  10. arXiv:2205.04934  [pdf, other

    cs.DS cs.DC

    The spatial computer: A model for energy-efficient parallel computation

    Authors: Lukas Gianinazzi, Tal Ben-Nun, Maciej Besta, Saleh Ashkboos, Yves Baumann, Piotr Luczynski, Torsten Hoefler

    Abstract: We present a new parallel model of computation suitable for spatial architectures, for which the energy used for communication heavily depends on the distance of the communicating processors. In our model, processors have locations on a conceptual two-dimensional grid, and their distance therein determines their communication cost. In particular, we introduce the energy cost of a spatial computati… ▽ More

    Submitted 17 January, 2023; v1 submitted 10 May, 2022; originally announced May 2022.

    ACM Class: F.2.0

  11. arXiv:2106.15565  [pdf, other

    cs.DC cs.AR cs.NI

    Flare: Flexible In-Network Allreduce

    Authors: Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, Torsten Hoefler

    Abstract: The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities… ▽ More

    Submitted 29 June, 2021; originally announced June 2021.

    ACM Class: C.2.4; C.2.1; B.4.3

    Journal ref: Published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '21) (2021)

  12. arXiv:2106.00761  [pdf, other

    cs.SI cs.LG

    Motif Prediction with Graph Neural Networks

    Authors: Maciej Besta, Raphael Grob, Cesare Miglioli, Nicola Bernold, Grzegorz Kwasniewski, Gabriel G**i, Raghavendra Kanakagiri, Saleh Ashkboos, Lukas Gianinazzi, Nikoli Dryden, Torsten Hoefler

    Abstract: Link prediction is one of the central problems in graph mining. However, recent studies highlight the importance of higher-order network analysis, where complex structures called motifs are the first-class citizens. We first show that existing link prediction schemes fail to effectively predict motifs. To alleviate this, we establish a general motif prediction problem and we propose several heuris… ▽ More

    Submitted 21 May, 2022; v1 submitted 26 May, 2021; originally announced June 2021.

    Journal ref: Proceedings of the 28th SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'22), 2022

  13. arXiv:2002.09268  [pdf, other

    cs.LG cs.DC stat.ML

    New Bounds For Distributed Mean Estimation and Variance Reduction

    Authors: Peter Davies, Vijaykrishna Gurunathan, Niusha Moshrefi, Saleh Ashkboos, Dan Alistarh

    Abstract: We consider the problem of distributed mean estimation (DME), in which $n$ machines are each given a local $d$-dimensional vector $x_v \in \mathbb{R}^d$, and must cooperate to estimate the mean of their inputs $μ= \frac 1n\sum_{v = 1}^n x_v$, while minimizing total communication cost. DME is a fundamental construct in distributed machine learning, and there has been considerable work on variants… ▽ More

    Submitted 7 April, 2021; v1 submitted 21 February, 2020; originally announced February 2020.

    Comments: 42 pages, 16 figures

  14. arXiv:1802.08021  [pdf, other

    cs.DC stat.ML

    SparCML: High-Performance Sparse Communication for Machine Learning

    Authors: Cedric Renggli, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, Torsten Hoefler

    Abstract: Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node's contribution to the overall gradient is summed using a global allreduce. This allreduce is the single communication and thus scalability bottleneck for most machine learnin… ▽ More

    Submitted 16 August, 2019; v1 submitted 22 February, 2018; originally announced February 2018.

  15. arXiv:1702.05570  [pdf, ps, other

    cs.DS

    Multi-way sparsest cut problem on trees with a control on the number of parts and outliers

    Authors: Ramin Javadi, Saleh Ashkboos

    Abstract: Given a graph, the sparsest cut problem asks for a subset of vertices whose edge expansion (the normalized cut given by the subset) is minimized. In this paper, we study a generalization of this problem seeking for $ k $ disjoint subsets of vertices (clusters) whose all edge expansions are small and furthermore, the number of vertices remained in the exterior of the subsets (outliers) is also smal… ▽ More

    Submitted 17 February, 2017; originally announced February 2017.

    Comments: 14 pages

    MSC Class: 05C85; 68Q25; 68R10

  16. arXiv:1702.04739  [pdf, other

    cs.DC

    An Efficient Parallel Data Clustering Algorithm Using Isoperimetric Number of Trees

    Authors: Ramin Javadi, Saleh Ashkboos

    Abstract: We propose a parallel graph-based data clustering algorithm using CUDA GPU, based on exact clustering of the minimum spanning tree in terms of a minimum isoperimetric criteria. We also provide a comparative performance analysis of our algorithm with other related ones which demonstrates the general superiority of this parallel algorithm over other competing algorithms in terms of accuracy and spee… ▽ More

    Submitted 15 February, 2017; originally announced February 2017.

    Comments: 16 pages, 6 figures

  17. arXiv:1702.01253  [pdf, other

    math.CO

    Minimum edge cuts of distance-regular and strongly regular digraphs

    Authors: S. Ashkboos, G. R. Omidi, F. Shafiei, K. Tajbakhsh

    Abstract: In this paper, we show that the edge connectivity of a distance-regular digraph $Γ$ with valency $k$ is $k$ and for $k>2$, any minimum edge cut of $Γ$ is the set of all edges going into (or coming out of) a single vertex. Moreover we show that the same result holds for strongly regular digraphs. These results extend the same known results for undirected case with quite different proofs.

    Submitted 4 February, 2017; originally announced February 2017.

    Comments: 9 pages, 1 figure