Search | arXiv e-print repository

QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead

Authors: Amir Zandieh, Majid Daliri, Insu Han

Abstract: Serving LLMs requires substantial memory due to the storage requirements of Key-Value (KV) embeddings in the KV cache, which grows with sequence length. An effective approach to compress KV cache is quantization. However, traditional quantization methods face significant memory overhead due to the need to store quantization constants (at least a zero point and a scale) in full precision per data b… ▽ More Serving LLMs requires substantial memory due to the storage requirements of Key-Value (KV) embeddings in the KV cache, which grows with sequence length. An effective approach to compress KV cache is quantization. However, traditional quantization methods face significant memory overhead due to the need to store quantization constants (at least a zero point and a scale) in full precision per data block. Depending on the block size, this overhead can add 1 or 2 bits per quantized number. We introduce QJL, a new quantization approach that consists of a Johnson-Lindenstrauss (JL) transform followed by sign-bit quantization. In contrast to existing methods, QJL eliminates memory overheads by removing the need for storing quantization constants. We propose an asymmetric estimator for the inner product of two vectors and demonstrate that applying QJL to one vector and a standard JL transform without quantization to the other provides an unbiased estimator with minimal distortion. We have developed an efficient implementation of the QJL sketch and its corresponding inner product estimator, incorporating a lightweight CUDA kernel for optimized computation. When applied across various LLMs and NLP tasks to quantize the KV cache to only 3 bits, QJL demonstrates a more than fivefold reduction in KV cache memory usage without compromising accuracy, all while achieving faster runtime. Codes are available at \url{https://github.com/amirzandieh/QJL}. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: 13 pages

arXiv:2309.16157 [pdf, other]

Sampling Methods for Inner Product Sketching

Authors: Majid Daliri, Juliana Freire, Christopher Musco, Aécio Santos, Haoxiang Zhang

Abstract: Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. We further develop this finding by introducing and analyzing two alternative sampling-based methods. In contrast to the co… ▽ More Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. We further develop this finding by introducing and analyzing two alternative sampling-based methods. In contrast to the computationally expensive algorithm in Bessa et al., our methods run in linear time (to compute the sketch) and perform better in practice, significantly beating linear sketching on a variety of tasks. For example, they provide state-of-the-art results for estimating the correlation between columns in unjoined tables, a problem that we show how to reduce to inner product estimation in a black-box way. While based on known sampling techniques (threshold and priority sampling) we introduce significant new theoretical analysis to prove approximation guarantees for our methods. △ Less

Submitted 15 January, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

Comments: 17 pages, 10 figures

arXiv:2308.05907 [pdf, ps, other]

Simple Analysis of Priority Sampling

Authors: Majid Daliri, Juliana Freire, Christopher Musco, Aécio Santos, Haoxiang Zhang

Abstract: We prove a tight upper bound on the variance of the priority sampling method (aka sequential Poisson sampling). Our proof is significantly shorter and simpler than the original proof given by Mario Szegedy at STOC 2006, which resolved a conjecture by Duffield, Lund, and Thorup. We prove a tight upper bound on the variance of the priority sampling method (aka sequential Poisson sampling). Our proof is significantly shorter and simpler than the original proof given by Mario Szegedy at STOC 2006, which resolved a conjecture by Duffield, Lund, and Thorup. △ Less

Submitted 10 August, 2023; originally announced August 2023.

Comments: 6 pages

arXiv:2302.02451 [pdf, other]

KDEformer: Accelerating Transformers via Kernel Density Estimation

Authors: Amir Zandieh, Insu Han, Majid Daliri, Amin Karbasi

Abstract: Dot-product attention mechanism plays a crucial role in modern deep architectures (e.g., Transformer) for sequence modeling, however, naïve exact computation of this model incurs quadratic time and memory complexities in sequence length, hindering the training of long-sequence models. Critical bottlenecks are due to the computation of partition functions in the denominator of softmax function as w… ▽ More Dot-product attention mechanism plays a crucial role in modern deep architectures (e.g., Transformer) for sequence modeling, however, naïve exact computation of this model incurs quadratic time and memory complexities in sequence length, hindering the training of long-sequence models. Critical bottlenecks are due to the computation of partition functions in the denominator of softmax function as well as the multiplication of the softmax matrix with the matrix of values. Our key observation is that the former can be reduced to a variant of the kernel density estimation (KDE) problem, and an efficient KDE solver can be further utilized to accelerate the latter via subsampling-based fast matrix products. Our proposed KDEformer can approximate the attention in sub-quadratic time with provable spectral norm bounds, while all prior results merely provide entry-wise error bounds. Empirically, we verify that KDEformer outperforms other attention approximations in terms of accuracy, memory, and runtime on various pre-trained models. On BigGAN image generation, we achieve better generative scores than the exact computation with over $4\times$ speedup. For ImageNet classification with T2T-ViT, KDEformer shows over $18\times$ speedup while the accuracy drop is less than $0.5\%$. △ Less

Submitted 29 June, 2023; v1 submitted 5 February, 2023; originally announced February 2023.

Comments: 26 pages, 7 figures

arXiv:2301.05811 [pdf, other]

Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation

Authors: Aline Bessa, Majid Daliri, Juliana Freire, Cameron Musco, Christopher Musco, Aécio Santos, Haoxiang Zhang

Abstract: We present a new approach for computing compact sketches that can be used to approximate the inner product between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees that improve on the guarantees of popular linear sketching approaches for inner product estimation, such as CountSketch and Johnson-Lindenstrauss projection. Spec… ▽ More We present a new approach for computing compact sketches that can be used to approximate the inner product between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees that improve on the guarantees of popular linear sketching approaches for inner product estimation, such as CountSketch and Johnson-Lindenstrauss projection. Specifically, while our method admits guarantees that exactly match linear sketching for dense vectors, it yields significantly lower error for sparse vectors with limited overlap between non-zero entries. Such vectors arise in many applications involving sparse data. They are also important in increasingly popular dataset search applications, where inner product sketches are used to estimate data covariance, conditional means, and other quantities involving columns in unjoined tables. We complement our theoretical results by showing that our approach empirically outperforms existing linear sketches and unweighted hashing-based sketches for sparse vectors. △ Less

Submitted 5 May, 2023; v1 submitted 13 January, 2023; originally announced January 2023.

Comments: 23 pages, 6 figures

Journal ref: In Proceedings of the ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS) 2023

arXiv:1901.02943 [pdf]

doi 10.22041/IJBME.2017.72949.1276

Brain Electrical Stimulation for Animal Navigation

Authors: Amirmasoud Ahmadi, Sepideh Farakhor Seghinsara, Mohammad Reza Daliri, Vahid Shalchyan

Abstract: The brain stimulation and its widespread use is one of the most important subjects in studies of neurophysiology. In brain electrical stimulation methods, following the surgery and electrode implantation, electrodes send electrical impulses to the specific targets in the brain. The use of this stimulation method is provided therapeutic benefits for treatment chronic pain, essential tremor, Parkins… ▽ More The brain stimulation and its widespread use is one of the most important subjects in studies of neurophysiology. In brain electrical stimulation methods, following the surgery and electrode implantation, electrodes send electrical impulses to the specific targets in the brain. The use of this stimulation method is provided therapeutic benefits for treatment chronic pain, essential tremor, Parkinsons disease, major depression, and neurological movement disorder syndrome (dystonia). One area in which advancements have been recently made is in controlling the movement and navigation of animals in a specific pathway. It is important to identify brain targets in order to stimulate appropriate brain regions for all the applications listed above. An animal navigation system based on brain electrical stimulation is used to develop new behavioral models for the aim of creating a platform for interacting with the animal nervous system in the spatial learning task. In the context of animal navigation the electrical stimulation has been used either as creating virtual sensation for movement guidance or virtual reward for movement motivation. In this paper, different approaches and techniques of brain electrical stimulation for this application has been reviewed. Keywords: Rat Robot, Brain Computer Interface, Electrical Stimulation, Cyborg Intelligence, Brain to Brain Interface △ Less

Submitted 1 December, 2018; originally announced January 2019.

Comments: in Farsi

Journal ref: Iranian Journal of Biomedical Engineering, 11(1), pp. 83-100

arXiv:1805.04703 [pdf]

doi 10.1109/EBBT.2017.7956756

A New Method for Epileptic Seizure Classification in EEG Using Adapted Wavelet Packets

Authors: Amirmasoud Ahmadi, Vahid Shalchyan, Mohammad Reza Daliri

Abstract: Electroencephalography (EEG), as the most common tool for epileptic seizure classification, contains useful information about different physiological states of the brain. Seizure related features in EEG signals can be better identified when localized in time frequency basis projections. In this work, a novel method for epileptic seizure classification based on wavelet packets (WPs) is presented in… ▽ More Electroencephalography (EEG), as the most common tool for epileptic seizure classification, contains useful information about different physiological states of the brain. Seizure related features in EEG signals can be better identified when localized in time frequency basis projections. In this work, a novel method for epileptic seizure classification based on wavelet packets (WPs) is presented in which both mother wavelet function and WP bases are adapted a posteriori to improve the seizure classification. A support vector machine (SVM) as classifier is used for seizure versus non-seizure EEG segment classification. In order to evaluate the proposed algorithm, a publicly available dataset containing different groups patient with epilepsy and healthy individuals are used. The obtained results indicate that the proposed method outperforms some previously proposed algorithms in epileptic seizure classification. △ Less

Submitted 12 May, 2018; originally announced May 2018.

Comments: Electroencephalography, Wavelet packets transform (WPT), Support vector machines (SVMs), Electric Electronics, Computer Science, Biomedical Engineerings' Meeting (EBBT), 2017

arXiv:1805.01743 [pdf]

doi 10.1109/EBBT.2018.8391471

Classification of Epileptic EEG Signals by Wavelet based CFC

Authors: Amirmasoud Ahmadi, Mahsa Behroozi, Vahid Shalchyan, Mohammad Reza Daliri

Abstract: Electroencephalogram, an influential equipment for analyzing humans activities and recognition of seizure attacks can play a crucial role in designing accurate systems which can distinguish ictal seizures from regular brain alertness, since it is the first step towards accomplishing a high accuracy computer aided diagnosis system (CAD). In this article a novel approach for classification of ictal… ▽ More Electroencephalogram, an influential equipment for analyzing humans activities and recognition of seizure attacks can play a crucial role in designing accurate systems which can distinguish ictal seizures from regular brain alertness, since it is the first step towards accomplishing a high accuracy computer aided diagnosis system (CAD). In this article a novel approach for classification of ictal signals with wavelet based cross frequency coupling (CFC) is suggested. After extracting features by wavelet based CFC, optimal features have been selected by t-test and quadratic discriminant analysis (QDA) have completed the Classification. △ Less

Submitted 4 May, 2018; originally announced May 2018.

Comments: Electroencephalogram; Wavelet Decomposition; Cross Frequency Coupling;Quadratic Discriminant Analysis; T-test Feature Selection

Journal ref: Electrical-Electronics & Biomedical Engineering and Computer Science in 2018 (EBBT 2018)

arXiv:1411.1257

Low Frequency LFP in Macaque MT Predicts Reaction Time in an Attentive Task

Authors: Kourosh Maboudi, Moein Esghaei, Mohammad Reza Daliri

Abstract: Neural oscillations are related to a wide variety of cognitive functions, including attention. However, there is still a controversy over the frequency bands that have functional roles in attention. In this study, using a spatial attention task we found that phase of low frequency oscillations could predict the reaction time of the monkey, when the monkey is attending to the target stimulus as opp… ▽ More Neural oscillations are related to a wide variety of cognitive functions, including attention. However, there is still a controversy over the frequency bands that have functional roles in attention. In this study, using a spatial attention task we found that phase of low frequency oscillations could predict the reaction time of the monkey, when the monkey is attending to the target stimulus as opposed to attending a distractor. This finding provides strong evidence for the functional role of low frequency bands in attentional modulation of neural activities. △ Less

Submitted 9 November, 2014; v1 submitted 5 November, 2014; originally announced November 2014.

Comments: 20 pages, 4 figures The paper has been withdrawn by the author due to misinterpretation of output results of the classification algorithm and so the produced figures have some major problems and need more investigation

Showing 1–9 of 9 results for author: Daliri, M