-
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
Authors:
Amir Zandieh,
Majid Daliri,
Insu Han
Abstract:
Serving LLMs requires substantial memory due to the storage requirements of Key-Value (KV) embeddings in the KV cache, which grows with sequence length. An effective approach to compress KV cache is quantization. However, traditional quantization methods face significant memory overhead due to the need to store quantization constants (at least a zero point and a scale) in full precision per data b…
▽ More
Serving LLMs requires substantial memory due to the storage requirements of Key-Value (KV) embeddings in the KV cache, which grows with sequence length. An effective approach to compress KV cache is quantization. However, traditional quantization methods face significant memory overhead due to the need to store quantization constants (at least a zero point and a scale) in full precision per data block. Depending on the block size, this overhead can add 1 or 2 bits per quantized number. We introduce QJL, a new quantization approach that consists of a Johnson-Lindenstrauss (JL) transform followed by sign-bit quantization. In contrast to existing methods, QJL eliminates memory overheads by removing the need for storing quantization constants. We propose an asymmetric estimator for the inner product of two vectors and demonstrate that applying QJL to one vector and a standard JL transform without quantization to the other provides an unbiased estimator with minimal distortion. We have developed an efficient implementation of the QJL sketch and its corresponding inner product estimator, incorporating a lightweight CUDA kernel for optimized computation. When applied across various LLMs and NLP tasks to quantize the KV cache to only 3 bits, QJL demonstrates a more than fivefold reduction in KV cache memory usage without compromising accuracy, all while achieving faster runtime. Codes are available at \url{https://github.com/amirzandieh/QJL}.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Sampling Methods for Inner Product Sketching
Authors:
Majid Daliri,
Juliana Freire,
Christopher Musco,
Aécio Santos,
Haoxiang Zhang
Abstract:
Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. We further develop this finding by introducing and analyzing two alternative sampling-based methods. In contrast to the co…
▽ More
Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. We further develop this finding by introducing and analyzing two alternative sampling-based methods. In contrast to the computationally expensive algorithm in Bessa et al., our methods run in linear time (to compute the sketch) and perform better in practice, significantly beating linear sketching on a variety of tasks. For example, they provide state-of-the-art results for estimating the correlation between columns in unjoined tables, a problem that we show how to reduce to inner product estimation in a black-box way. While based on known sampling techniques (threshold and priority sampling) we introduce significant new theoretical analysis to prove approximation guarantees for our methods.
△ Less
Submitted 15 January, 2024; v1 submitted 28 September, 2023;
originally announced September 2023.
-
Simple Analysis of Priority Sampling
Authors:
Majid Daliri,
Juliana Freire,
Christopher Musco,
Aécio Santos,
Haoxiang Zhang
Abstract:
We prove a tight upper bound on the variance of the priority sampling method (aka sequential Poisson sampling). Our proof is significantly shorter and simpler than the original proof given by Mario Szegedy at STOC 2006, which resolved a conjecture by Duffield, Lund, and Thorup.
We prove a tight upper bound on the variance of the priority sampling method (aka sequential Poisson sampling). Our proof is significantly shorter and simpler than the original proof given by Mario Szegedy at STOC 2006, which resolved a conjecture by Duffield, Lund, and Thorup.
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
KDEformer: Accelerating Transformers via Kernel Density Estimation
Authors:
Amir Zandieh,
Insu Han,
Majid Daliri,
Amin Karbasi
Abstract:
Dot-product attention mechanism plays a crucial role in modern deep architectures (e.g., Transformer) for sequence modeling, however, naïve exact computation of this model incurs quadratic time and memory complexities in sequence length, hindering the training of long-sequence models. Critical bottlenecks are due to the computation of partition functions in the denominator of softmax function as w…
▽ More
Dot-product attention mechanism plays a crucial role in modern deep architectures (e.g., Transformer) for sequence modeling, however, naïve exact computation of this model incurs quadratic time and memory complexities in sequence length, hindering the training of long-sequence models. Critical bottlenecks are due to the computation of partition functions in the denominator of softmax function as well as the multiplication of the softmax matrix with the matrix of values. Our key observation is that the former can be reduced to a variant of the kernel density estimation (KDE) problem, and an efficient KDE solver can be further utilized to accelerate the latter via subsampling-based fast matrix products. Our proposed KDEformer can approximate the attention in sub-quadratic time with provable spectral norm bounds, while all prior results merely provide entry-wise error bounds. Empirically, we verify that KDEformer outperforms other attention approximations in terms of accuracy, memory, and runtime on various pre-trained models. On BigGAN image generation, we achieve better generative scores than the exact computation with over $4\times$ speedup. For ImageNet classification with T2T-ViT, KDEformer shows over $18\times$ speedup while the accuracy drop is less than $0.5\%$.
△ Less
Submitted 29 June, 2023; v1 submitted 5 February, 2023;
originally announced February 2023.
-
Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation
Authors:
Aline Bessa,
Majid Daliri,
Juliana Freire,
Cameron Musco,
Christopher Musco,
Aécio Santos,
Haoxiang Zhang
Abstract:
We present a new approach for computing compact sketches that can be used to approximate the inner product between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees that improve on the guarantees of popular linear sketching approaches for inner product estimation, such as CountSketch and Johnson-Lindenstrauss projection. Spec…
▽ More
We present a new approach for computing compact sketches that can be used to approximate the inner product between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees that improve on the guarantees of popular linear sketching approaches for inner product estimation, such as CountSketch and Johnson-Lindenstrauss projection. Specifically, while our method admits guarantees that exactly match linear sketching for dense vectors, it yields significantly lower error for sparse vectors with limited overlap between non-zero entries. Such vectors arise in many applications involving sparse data. They are also important in increasingly popular dataset search applications, where inner product sketches are used to estimate data covariance, conditional means, and other quantities involving columns in unjoined tables. We complement our theoretical results by showing that our approach empirically outperforms existing linear sketches and unweighted hashing-based sketches for sparse vectors.
△ Less
Submitted 5 May, 2023; v1 submitted 13 January, 2023;
originally announced January 2023.
-
Brain Electrical Stimulation for Animal Navigation
Authors:
Amirmasoud Ahmadi,
Sepideh Farakhor Seghinsara,
Mohammad Reza Daliri,
Vahid Shalchyan
Abstract:
The brain stimulation and its widespread use is one of the most important subjects in studies of neurophysiology. In brain electrical stimulation methods, following the surgery and electrode implantation, electrodes send electrical impulses to the specific targets in the brain. The use of this stimulation method is provided therapeutic benefits for treatment chronic pain, essential tremor, Parkins…
▽ More
The brain stimulation and its widespread use is one of the most important subjects in studies of neurophysiology. In brain electrical stimulation methods, following the surgery and electrode implantation, electrodes send electrical impulses to the specific targets in the brain. The use of this stimulation method is provided therapeutic benefits for treatment chronic pain, essential tremor, Parkinsons disease, major depression, and neurological movement disorder syndrome (dystonia). One area in which advancements have been recently made is in controlling the movement and navigation of animals in a specific pathway. It is important to identify brain targets in order to stimulate appropriate brain regions for all the applications listed above. An animal navigation system based on brain electrical stimulation is used to develop new behavioral models for the aim of creating a platform for interacting with the animal nervous system in the spatial learning task. In the context of animal navigation the electrical stimulation has been used either as creating virtual sensation for movement guidance or virtual reward for movement motivation. In this paper, different approaches and techniques of brain electrical stimulation for this application has been reviewed.
Keywords: Rat Robot, Brain Computer Interface, Electrical Stimulation, Cyborg Intelligence, Brain to Brain Interface
△ Less
Submitted 1 December, 2018;
originally announced January 2019.
-
A New Method for Epileptic Seizure Classification in EEG Using Adapted Wavelet Packets
Authors:
Amirmasoud Ahmadi,
Vahid Shalchyan,
Mohammad Reza Daliri
Abstract:
Electroencephalography (EEG), as the most common tool for epileptic seizure classification, contains useful information about different physiological states of the brain. Seizure related features in EEG signals can be better identified when localized in time frequency basis projections. In this work, a novel method for epileptic seizure classification based on wavelet packets (WPs) is presented in…
▽ More
Electroencephalography (EEG), as the most common tool for epileptic seizure classification, contains useful information about different physiological states of the brain. Seizure related features in EEG signals can be better identified when localized in time frequency basis projections. In this work, a novel method for epileptic seizure classification based on wavelet packets (WPs) is presented in which both mother wavelet function and WP bases are adapted a posteriori to improve the seizure classification. A support vector machine (SVM) as classifier is used for seizure versus non-seizure EEG segment classification. In order to evaluate the proposed algorithm, a publicly available dataset containing different groups patient with epilepsy and healthy individuals are used. The obtained results indicate that the proposed method outperforms some previously proposed algorithms in epileptic seizure classification.
△ Less
Submitted 12 May, 2018;
originally announced May 2018.
-
Classification of Epileptic EEG Signals by Wavelet based CFC
Authors:
Amirmasoud Ahmadi,
Mahsa Behroozi,
Vahid Shalchyan,
Mohammad Reza Daliri
Abstract:
Electroencephalogram, an influential equipment for analyzing humans activities and recognition of seizure attacks can play a crucial role in designing accurate systems which can distinguish ictal seizures from regular brain alertness, since it is the first step towards accomplishing a high accuracy computer aided diagnosis system (CAD). In this article a novel approach for classification of ictal…
▽ More
Electroencephalogram, an influential equipment for analyzing humans activities and recognition of seizure attacks can play a crucial role in designing accurate systems which can distinguish ictal seizures from regular brain alertness, since it is the first step towards accomplishing a high accuracy computer aided diagnosis system (CAD). In this article a novel approach for classification of ictal signals with wavelet based cross frequency coupling (CFC) is suggested. After extracting features by wavelet based CFC, optimal features have been selected by t-test and quadratic discriminant analysis (QDA) have completed the Classification.
△ Less
Submitted 4 May, 2018;
originally announced May 2018.
-
Low Frequency LFP in Macaque MT Predicts Reaction Time in an Attentive Task
Authors:
Kourosh Maboudi,
Moein Esghaei,
Mohammad Reza Daliri
Abstract:
Neural oscillations are related to a wide variety of cognitive functions, including attention. However, there is still a controversy over the frequency bands that have functional roles in attention. In this study, using a spatial attention task we found that phase of low frequency oscillations could predict the reaction time of the monkey, when the monkey is attending to the target stimulus as opp…
▽ More
Neural oscillations are related to a wide variety of cognitive functions, including attention. However, there is still a controversy over the frequency bands that have functional roles in attention. In this study, using a spatial attention task we found that phase of low frequency oscillations could predict the reaction time of the monkey, when the monkey is attending to the target stimulus as opposed to attending a distractor. This finding provides strong evidence for the functional role of low frequency bands in attentional modulation of neural activities.
△ Less
Submitted 9 November, 2014; v1 submitted 5 November, 2014;
originally announced November 2014.