-
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
Authors:
Amir Zandieh,
Majid Daliri,
Insu Han
Abstract:
Serving LLMs requires substantial memory due to the storage requirements of Key-Value (KV) embeddings in the KV cache, which grows with sequence length. An effective approach to compress KV cache is quantization. However, traditional quantization methods face significant memory overhead due to the need to store quantization constants (at least a zero point and a scale) in full precision per data b…
▽ More
Serving LLMs requires substantial memory due to the storage requirements of Key-Value (KV) embeddings in the KV cache, which grows with sequence length. An effective approach to compress KV cache is quantization. However, traditional quantization methods face significant memory overhead due to the need to store quantization constants (at least a zero point and a scale) in full precision per data block. Depending on the block size, this overhead can add 1 or 2 bits per quantized number. We introduce QJL, a new quantization approach that consists of a Johnson-Lindenstrauss (JL) transform followed by sign-bit quantization. In contrast to existing methods, QJL eliminates memory overheads by removing the need for storing quantization constants. We propose an asymmetric estimator for the inner product of two vectors and demonstrate that applying QJL to one vector and a standard JL transform without quantization to the other provides an unbiased estimator with minimal distortion. We have developed an efficient implementation of the QJL sketch and its corresponding inner product estimator, incorporating a lightweight CUDA kernel for optimized computation. When applied across various LLMs and NLP tasks to quantize the KV cache to only 3 bits, QJL demonstrates a more than fivefold reduction in KV cache memory usage without compromising accuracy, all while achieving faster runtime. Codes are available at \url{https://github.com/amirzandieh/QJL}.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
SubGen: Token Generation in Sublinear Time and Memory
Authors:
Amir Zandieh,
Insu Han,
Vahab Mirrokni,
Amin Karbasi
Abstract:
Despite the significant success of large language models (LLMs), their extensive memory requirements pose challenges for deploying them in long-context token generation. The substantial memory footprint of LLM decoders arises from the necessity to store all previous tokens in the attention module, a requirement imposed by key-value (KV) caching. In this work, our focus is on develo** an efficien…
▽ More
Despite the significant success of large language models (LLMs), their extensive memory requirements pose challenges for deploying them in long-context token generation. The substantial memory footprint of LLM decoders arises from the necessity to store all previous tokens in the attention module, a requirement imposed by key-value (KV) caching. In this work, our focus is on develo** an efficient compression technique for the KV cache. Empirical evidence indicates a significant clustering tendency within key embeddings in the attention module. Building on this key insight, we have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $\ell_2$ sampling on values. The result is a provably accurate and efficient attention decoding algorithm, termed SubGen. Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach. Empirical evaluations on long-context question-answering tasks demonstrate that SubGen significantly outperforms existing and state-of-the-art KV cache compression methods in terms of performance and efficiency.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
HyperAttention: Long-context Attention in Near-Linear Time
Authors:
Insu Han,
Rajesh Jayaram,
Amin Karbasi,
Vahab Mirrokni,
David P. Woodruff,
Amir Zandieh
Abstract:
We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts used in Large Language Models (LLMs). Recent work suggests that in the worst-case scenario, quadratic time is necessary unless the entries of the attention matrix are bounded or the matrix has low stable rank. We introduce two parameters which…
▽ More
We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts used in Large Language Models (LLMs). Recent work suggests that in the worst-case scenario, quadratic time is necessary unless the entries of the attention matrix are bounded or the matrix has low stable rank. We introduce two parameters which measure: (1) the max column norm in the normalized attention matrix, and (2) the ratio of row norms in the unnormalized attention matrix after detecting and removing large entries. We use these fine-grained parameters to capture the hardness of the problem. Despite previous lower bounds, we are able to achieve a linear time sampling algorithm even when the matrix has unbounded entries or a large stable rank, provided the above parameters are small. HyperAttention features a modular design that easily accommodates integration of other fast low-level implementations, particularly FlashAttention. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods, giving significant speed improvements compared to state-of-the-art solutions like FlashAttention. We validate the empirical performance of HyperAttention on a variety of different long-context length datasets. For example, HyperAttention makes the inference time of ChatGLM2 50\% faster on 32k context length while perplexity increases from 5.6 to 6.3. On larger context length, e.g., 131k, with causal masking, HyperAttention offers 5-fold speedup on a single attention layer.
△ Less
Submitted 1 December, 2023; v1 submitted 9 October, 2023;
originally announced October 2023.
-
KDEformer: Accelerating Transformers via Kernel Density Estimation
Authors:
Amir Zandieh,
Insu Han,
Majid Daliri,
Amin Karbasi
Abstract:
Dot-product attention mechanism plays a crucial role in modern deep architectures (e.g., Transformer) for sequence modeling, however, naïve exact computation of this model incurs quadratic time and memory complexities in sequence length, hindering the training of long-sequence models. Critical bottlenecks are due to the computation of partition functions in the denominator of softmax function as w…
▽ More
Dot-product attention mechanism plays a crucial role in modern deep architectures (e.g., Transformer) for sequence modeling, however, naïve exact computation of this model incurs quadratic time and memory complexities in sequence length, hindering the training of long-sequence models. Critical bottlenecks are due to the computation of partition functions in the denominator of softmax function as well as the multiplication of the softmax matrix with the matrix of values. Our key observation is that the former can be reduced to a variant of the kernel density estimation (KDE) problem, and an efficient KDE solver can be further utilized to accelerate the latter via subsampling-based fast matrix products. Our proposed KDEformer can approximate the attention in sub-quadratic time with provable spectral norm bounds, while all prior results merely provide entry-wise error bounds. Empirically, we verify that KDEformer outperforms other attention approximations in terms of accuracy, memory, and runtime on various pre-trained models. On BigGAN image generation, we achieve better generative scores than the exact computation with over $4\times$ speedup. For ImageNet classification with T2T-ViT, KDEformer shows over $18\times$ speedup while the accuracy drop is less than $0.5\%$.
△ Less
Submitted 29 June, 2023; v1 submitted 5 February, 2023;
originally announced February 2023.
-
Fast Neural Kernel Embeddings for General Activations
Authors:
Insu Han,
Amir Zandieh,
Jaehoon Lee,
Roman Novak,
Lechao Xiao,
Amin Karbasi
Abstract:
Infinite width limit has shed light on generalization and optimization aspects of deep learning by establishing connections between neural networks and kernel methods. Despite their importance, the utility of these kernel methods was limited in large-scale learning settings due to their (super-)quadratic runtime and memory complexities. Moreover, most prior works on neural kernels have focused on…
▽ More
Infinite width limit has shed light on generalization and optimization aspects of deep learning by establishing connections between neural networks and kernel methods. Despite their importance, the utility of these kernel methods was limited in large-scale learning settings due to their (super-)quadratic runtime and memory complexities. Moreover, most prior works on neural kernels have focused on the ReLU activation, mainly due to its popularity but also due to the difficulty of computing such kernels for general activations. In this work, we overcome such difficulties by providing methods to work with general activations. First, we compile and expand the list of activation functions admitting exact dual activation expressions to compute neural kernels. When the exact computation is unknown, we present methods to effectively approximate them. We propose a fast sketching method that approximates any multi-layered Neural Network Gaussian Process (NNGP) kernel and Neural Tangent Kernel (NTK) matrices for a wide range of activation functions, going beyond the commonly analyzed ReLU activation. This is done by showing how to approximate the neural kernels using the truncated Hermite expansion of any desired activation functions. While most prior works require data points on the unit sphere, our methods do not suffer from such limitations and are applicable to any dataset of points in $\mathbb{R}^d$. Furthermore, we provide a subspace embedding for NNGP and NTK matrices with near input-sparsity runtime and near-optimal target dimension which applies to any \emph{homogeneous} dual activation functions with rapidly convergent Taylor expansion. Empirically, with respect to exact convolutional NTK (CNTK) computation, our method achieves $106\times$ speedup for approximate CNTK of a 5-layer Myrtle network on CIFAR-10 dataset.
△ Less
Submitted 9 September, 2022;
originally announced September 2022.
-
Near Optimal Reconstruction of Spherical Harmonic Expansions
Authors:
Amir Zandieh,
Insu Han,
Haim Avron
Abstract:
We propose an algorithm for robust recovery of the spherical harmonic expansion of functions defined on the d-dimensional unit sphere $\mathbb{S}^{d-1}$ using a near-optimal number of function evaluations. We show that for any $f \in L^2(\mathbb{S}^{d-1})$, the number of evaluations of $f$ needed to recover its degree-$q$ spherical harmonic expansion equals the dimension of the space of spherical…
▽ More
We propose an algorithm for robust recovery of the spherical harmonic expansion of functions defined on the d-dimensional unit sphere $\mathbb{S}^{d-1}$ using a near-optimal number of function evaluations. We show that for any $f \in L^2(\mathbb{S}^{d-1})$, the number of evaluations of $f$ needed to recover its degree-$q$ spherical harmonic expansion equals the dimension of the space of spherical harmonics of degree at most $q$ up to a logarithmic factor. Moreover, we develop a simple yet efficient algorithm to recover degree-$q$ expansion of $f$ by only evaluating the function on uniformly sampled points on $\mathbb{S}^{d-1}$. Our algorithm is based on the connections between spherical harmonics and Gegenbauer polynomials and leverage score sampling methods. Unlike the prior results on fast spherical harmonic transform, our proposed algorithm works efficiently using a nearly optimal number of samples in any dimension d. We further illustrate the empirical performance of our algorithm on numerical examples.
△ Less
Submitted 25 February, 2022;
originally announced February 2022.
-
Leverage Score Sampling for Tensor Product Matrices in Input Sparsity Time
Authors:
David P. Woodruff,
Amir Zandieh
Abstract:
We propose an input sparsity time sampling algorithm that can spectrally approximate the Gram matrix corresponding to the $q$-fold column-wise tensor product of $q$ matrices using a nearly optimal number of samples, improving upon all previously known methods by poly$(q)$ factors. Furthermore, for the important special case of the $q$-fold self-tensoring of a dataset, which is the feature matrix o…
▽ More
We propose an input sparsity time sampling algorithm that can spectrally approximate the Gram matrix corresponding to the $q$-fold column-wise tensor product of $q$ matrices using a nearly optimal number of samples, improving upon all previously known methods by poly$(q)$ factors. Furthermore, for the important special case of the $q$-fold self-tensoring of a dataset, which is the feature matrix of the degree-$q$ polynomial kernel, the leading term of our method's runtime is proportional to the size of the input dataset and has no dependence on $q$. Previous techniques either incur poly$(q)$ slowdowns in their runtime or remove the dependence on $q$ at the expense of having sub-optimal target dimension, and depend quadratically on the number of data-points in their runtime. Our sampling technique relies on a collection of $q$ partially correlated random projections which can be simultaneously applied to a dataset $X$ in total time that only depends on the size of $X$, and at the same time their $q$-fold Kronecker product acts as a near-isometry for any fixed vector in the column span of $X^{\otimes q}$. We also show that our sampling methods generalize to other classes of kernels beyond polynomial, such as Gaussian and Neural Tangent kernels.
△ Less
Submitted 24 June, 2022; v1 submitted 9 February, 2022;
originally announced February 2022.
-
Random Gegenbauer Features for Scalable Kernel Methods
Authors:
Insu Han,
Amir Zandieh,
Haim Avron
Abstract:
We propose efficient random features for approximating a new and rich class of kernel functions that we refer to as Generalized Zonal Kernels (GZK). Our proposed GZK family, generalizes the zonal kernels (i.e., dot-product kernels on the unit sphere) by introducing radial factors in their Gegenbauer series expansion, and includes a wide range of ubiquitous kernel functions such as the entirety of…
▽ More
We propose efficient random features for approximating a new and rich class of kernel functions that we refer to as Generalized Zonal Kernels (GZK). Our proposed GZK family, generalizes the zonal kernels (i.e., dot-product kernels on the unit sphere) by introducing radial factors in their Gegenbauer series expansion, and includes a wide range of ubiquitous kernel functions such as the entirety of dot-product kernels as well as the Gaussian and the recently introduced Neural Tangent kernels. Interestingly, by exploiting the reproducing property of the Gegenbauer polynomials, we can construct efficient random features for the GZK family based on randomly oriented Gegenbauer kernels. We prove subspace embedding guarantees for our Gegenbauer features which ensures that our features can be used for approximately solving learning problems such as kernel k-means clustering, kernel ridge regression, etc. Empirical results show that our proposed features outperform recent kernel approximation methods.
△ Less
Submitted 7 February, 2022;
originally announced February 2022.
-
Traversing the FFT Computation Tree for Dimension-Independent Sparse Fourier Transforms
Authors:
Karl Bringmann,
Michael Kapralov,
Mikhail Makarov,
Vasileios Nakos,
Amir Yagudin,
Amir Zandieh
Abstract:
We consider the well-studied Sparse Fourier transform problem, where one aims to quickly recover an approximately Fourier $k$-sparse vector $\widehat{x} \in \mathbb{C}^{n^d}$ from observing its time domain representation $x$. In the exact $k$-sparse case the best known dimension-independent algorithm runs in near cubic time in $k$ and it is unclear whether a faster algorithm like in low dimensions…
▽ More
We consider the well-studied Sparse Fourier transform problem, where one aims to quickly recover an approximately Fourier $k$-sparse vector $\widehat{x} \in \mathbb{C}^{n^d}$ from observing its time domain representation $x$. In the exact $k$-sparse case the best known dimension-independent algorithm runs in near cubic time in $k$ and it is unclear whether a faster algorithm like in low dimensions is possible. Beyond that, all known approaches either suffer from an exponential dependence on the dimension $d$ or can only tolerate a trivial amount of noise. This is in sharp contrast with the classical FFT of Cooley and Tukey, which is stable and completely insensitive to the dimension of the input vector: its runtime is $O(N\log N)$ in any dimension $d$ for $N=n^d$. Our work aims to address the above issues.
First, we provide a translation/reduction of the exactly $k$-sparse FT problem to a concrete tree exploration task which asks to recover $k$ leaves in a full binary tree under certain exploration rules. Subsequently, we provide (a) an almost quadratic in $k$ time algorithm for this task, and (b) evidence that a strongly subquadratic time for Sparse FT via this approach is likely impossible. We achieve the latter by proving a conditional quadratic time lower bound on sparse polynomial multipoint evaluation (the classical non-equispaced sparse FT) which is a core routine in the aforementioned translation. Thus, our results combined can be viewed as an almost complete understanding of this approach, which is the only known approach that yields sublinear time dimension-independent Sparse FT algorithms.
Subsequently, we provide a robustification of our algorithm, yielding a robust cubic time algorithm under bounded $\ell_2$ noise. This requires proving new structural properties of the recently introduced adaptive aliasing filters combined with a variety of new techniques and ideas.
△ Less
Submitted 22 January, 2023; v1 submitted 15 July, 2021;
originally announced July 2021.
-
Scaling Neural Tangent Kernels via Sketching and Random Features
Authors:
Amir Zandieh,
Insu Han,
Haim Avron,
Neta Shoham,
Chaewon Kim,
**woo Shin
Abstract:
The Neural Tangent Kernel (NTK) characterizes the behavior of infinitely-wide neural networks trained under least squares loss by gradient descent. Recent works also report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets. However, the computational complexity of kernel methods has limited its use in large-scale learning tasks. To accelerate learning…
▽ More
The Neural Tangent Kernel (NTK) characterizes the behavior of infinitely-wide neural networks trained under least squares loss by gradient descent. Recent works also report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets. However, the computational complexity of kernel methods has limited its use in large-scale learning tasks. To accelerate learning with NTK, we design a near input-sparsity time approximation algorithm for NTK, by sketching the polynomial expansions of arc-cosine kernels: our sketch for the convolutional counterpart of NTK (CNTK) can transform any image using a linear runtime in the number of pixels. Furthermore, we prove a spectral approximation guarantee for the NTK matrix, by combining random features (based on leverage score sampling) of the arc-cosine kernels with a sketching algorithm. We benchmark our methods on various large-scale regression and classification tasks and show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
△ Less
Submitted 8 December, 2021; v1 submitted 15 June, 2021;
originally announced June 2021.
-
Learning with Neural Tangent Kernels in Near Input Sparsity Time
Authors:
Amir Zandieh
Abstract:
The Neural Tangent Kernel (NTK) characterizes the behavior of infinitely wide neural nets trained under least squares loss by gradient descent. However, despite its importance, the super-quadratic runtime of kernel methods limits the use of NTK in large-scale learning tasks. To accelerate kernel machines with NTK, we propose a near input sparsity time algorithm that maps the input data to a random…
▽ More
The Neural Tangent Kernel (NTK) characterizes the behavior of infinitely wide neural nets trained under least squares loss by gradient descent. However, despite its importance, the super-quadratic runtime of kernel methods limits the use of NTK in large-scale learning tasks. To accelerate kernel machines with NTK, we propose a near input sparsity time algorithm that maps the input data to a randomized low-dimensional feature space so that the inner product of the transformed data approximates their NTK evaluation. Our transformation works by sketching the polynomial expansions of arc-cosine kernels. Furthermore, we propose a feature map for approximating the convolutional counterpart of the NTK, which can transform any image using a runtime that is only linear in the number of pixels. We show that in standard large-scale regression and classification tasks a linear regressor trained on our features outperforms trained Neural Nets and Nystrom approximation of NTK kernel.
△ Less
Submitted 27 July, 2021; v1 submitted 1 April, 2021;
originally announced April 2021.
-
Near Input Sparsity Time Kernel Embeddings via Adaptive Sampling
Authors:
David P. Woodruff,
Amir Zandieh
Abstract:
To accelerate kernel methods, we propose a near input sparsity time algorithm for sampling the high-dimensional feature space implicitly defined by a kernel transformation. Our main contribution is an importance sampling method for subsampling the feature space of a degree $q$ tensoring of data points in almost input sparsity time, improving the recent oblivious sketching method of (Ahle et al., 2…
▽ More
To accelerate kernel methods, we propose a near input sparsity time algorithm for sampling the high-dimensional feature space implicitly defined by a kernel transformation. Our main contribution is an importance sampling method for subsampling the feature space of a degree $q$ tensoring of data points in almost input sparsity time, improving the recent oblivious sketching method of (Ahle et al., 2020) by a factor of $q^{5/2}/ε^2$. This leads to a subspace embedding for the polynomial kernel, as well as the Gaussian kernel, with a target dimension that is only linearly dependent on the statistical dimension of the kernel and in time which is only linearly dependent on the sparsity of the input dataset. We show how our subspace embedding bounds imply new statistical guarantees for kernel ridge regression. Furthermore, we empirically show that in large-scale regression tasks, our algorithm outperforms state-of-the-art kernel approximation methods.
△ Less
Submitted 14 July, 2020; v1 submitted 8 July, 2020;
originally announced July 2020.
-
Scaling up Kernel Ridge Regression via Locality Sensitive Hashing
Authors:
Michael Kapralov,
Navid Nouri,
Ilya Razenshteyn,
Ameya Velingker,
Amir Zandieh
Abstract:
Random binning features, introduced in the seminal paper of Rahimi and Recht (2007), are an efficient method for approximating a kernel matrix using locality sensitive hashing. Random binning features provide a very simple and efficient way of approximating the Laplace kernel but unfortunately do not apply to many important classes of kernels, notably ones that generate smooth Gaussian processes,…
▽ More
Random binning features, introduced in the seminal paper of Rahimi and Recht (2007), are an efficient method for approximating a kernel matrix using locality sensitive hashing. Random binning features provide a very simple and efficient way of approximating the Laplace kernel but unfortunately do not apply to many important classes of kernels, notably ones that generate smooth Gaussian processes, such as the Gaussian kernel and Matern kernel. In this paper, we introduce a simple weighted version of random binning features and show that the corresponding kernel function generates Gaussian processes of any desired smoothness. We show that our weighted random binning features provide a spectral approximation to the corresponding kernel matrix, leading to efficient algorithms for kernel ridge regression. Experiments on large scale regression datasets show that our method outperforms the accuracy of random Fourier features method.
△ Less
Submitted 21 March, 2020;
originally announced March 2020.
-
Oblivious Sketching of High-Degree Polynomial Kernels
Authors:
Thomas D. Ahle,
Michael Kapralov,
Jakob B. T. Knudsen,
Rasmus Pagh,
Ameya Velingker,
David Woodruff,
Amir Zandieh
Abstract:
Kernel methods are fundamental tools in machine learning that allow detection of non-linear dependencies between data without explicitly constructing feature vectors in high dimensional spaces. A major disadvantage of kernel methods is their poor scalability: primitives such as kernel PCA or kernel ridge regression generally take prohibitively large quadratic space and (at least) quadratic time, a…
▽ More
Kernel methods are fundamental tools in machine learning that allow detection of non-linear dependencies between data without explicitly constructing feature vectors in high dimensional spaces. A major disadvantage of kernel methods is their poor scalability: primitives such as kernel PCA or kernel ridge regression generally take prohibitively large quadratic space and (at least) quadratic time, as kernel matrices are usually dense. Some methods for speeding up kernel linear algebra are known, but they all invariably take time exponential in either the dimension of the input point set (e.g., fast multipole methods suffer from the curse of dimensionality) or in the degree of the kernel function.
Oblivious sketching has emerged as a powerful approach to speeding up numerical linear algebra over the past decade, but our understanding of oblivious sketching solutions for kernel matrices has remained quite limited, suffering from the aforementioned exponential dependence on input parameters. Our main contribution is a general method for applying sketching solutions developed in numerical linear algebra over the past decade to a tensoring of data points without forming the tensoring explicitly. This leads to the first oblivious sketch for the polynomial kernel with a target dimension that is only polynomially dependent on the degree of the kernel function, as well as the first oblivious sketch for the Gaussian kernel on bounded datasets that does not suffer from an exponential dependence on the dimensionality of input data points.
△ Less
Submitted 22 December, 2020; v1 submitted 3 September, 2019;
originally announced September 2019.
-
Dimension-independent Sparse Fourier Transform
Authors:
Michael Kapralov,
Ameya Velingker,
Amir Zandieh
Abstract:
The Discrete Fourier Transform (DFT) is a fundamental computational primitive, and the fastest known algorithm for computing the DFT is the FFT (Fast Fourier Transform) algorithm. One remarkable feature of FFT is the fact that its runtime depends only on the size $N$ of the input vector, but not on the dimensionality of the input domain: FFT runs in time $O(N\log N)$ irrespective of whether the DF…
▽ More
The Discrete Fourier Transform (DFT) is a fundamental computational primitive, and the fastest known algorithm for computing the DFT is the FFT (Fast Fourier Transform) algorithm. One remarkable feature of FFT is the fact that its runtime depends only on the size $N$ of the input vector, but not on the dimensionality of the input domain: FFT runs in time $O(N\log N)$ irrespective of whether the DFT in question is on $\mathbb{Z}_N$ or $\mathbb{Z}_n^d$ for some $d>1$, where $N=n^d$.
The state of the art for Sparse FFT, i.e. the problem of computing the DFT of a signal that has at most $k$ nonzeros in Fourier domain, is very different: all current techniques for sublinear time computation of Sparse FFT incur an exponential dependence on the dimension $d$ in the runtime. In this paper we give the first algorithm that computes the DFT of a $k$-sparse signal in time $\text{poly}(k, \log N)$ in any dimension $d$, avoiding the curse of dimensionality inherent in all previously known techniques. Our main tool is a new class of filters that we refer to as adaptive aliasing filters: these filters allow isolating frequencies of a $k$-Fourier sparse signal using $O(k)$ samples in time domain and $O(k\log N)$ runtime per frequency, in any dimension $d$.
We also investigate natural average case models of the input signal: (1) worst case support in Fourier domain with randomized coefficients and (2) random locations in Fourier domain with worst case coefficients. Our techniques lead to an $\widetilde O(k^2)$ time algorithm for the former and an $\widetilde O(k)$ time algorithm for the latter.
△ Less
Submitted 27 February, 2019;
originally announced February 2019.
-
A Universal Sampling Method for Reconstructing Signals with Simple Fourier Transforms
Authors:
Haim Avron,
Michael Kapralov,
Cameron Musco,
Christopher Musco,
Ameya Velingker,
Amir Zandieh
Abstract:
Reconstructing continuous signals from a small number of discrete samples is a fundamental problem across science and engineering. In practice, we are often interested in signals with 'simple' Fourier structure, such as bandlimited, multiband, and Fourier sparse signals. More broadly, any prior knowledge about a signal's Fourier power spectrum can constrain its complexity. Intuitively, signals wit…
▽ More
Reconstructing continuous signals from a small number of discrete samples is a fundamental problem across science and engineering. In practice, we are often interested in signals with 'simple' Fourier structure, such as bandlimited, multiband, and Fourier sparse signals. More broadly, any prior knowledge about a signal's Fourier power spectrum can constrain its complexity. Intuitively, signals with more highly constrained Fourier structure require fewer samples to reconstruct.
We formalize this intuition by showing that, roughly, a continuous signal from a given class can be approximately reconstructed using a number of samples proportional to the *statistical dimension* of the allowed power spectrum of that class. Further, in nearly all settings, this natural measure tightly characterizes the sample complexity of signal reconstruction.
Surprisingly, we also show that, up to logarithmic factors, a universal non-uniform sampling strategy can achieve this optimal complexity for *any class of signals*. We present a simple and efficient algorithm for recovering a signal from the samples taken. For bandlimited and sparse signals, our method matches the state-of-the-art. At the same time, it gives the first computationally and sample efficient solution to a broad range of problems, including multiband signal reconstruction and kriging and Gaussian process regression tasks in one dimension.
Our work is based on a novel connection between randomized linear algebra and signal reconstruction with constrained Fourier structure. We extend tools based on statistical leverage score sampling and column-based matrix reconstruction to the approximation of continuous linear operators that arise in signal reconstruction. We believe that these extensions are of independent interest and serve as a foundation for tackling a broad range of continuous time problems using randomized methods.
△ Less
Submitted 20 December, 2018;
originally announced December 2018.
-
Beyond $1/2$-Approximation for Submodular Maximization on Massive Data Streams
Authors:
Ashkan Norouzi-Fard,
Jakub Tarnawski,
Slobodan Mitrović,
Amir Zandieh,
Aida Mousavifar,
Ola Svensson
Abstract:
Many tasks in machine learning and data mining, such as data diversification, non-parametric learning, kernel machines, clustering etc., require extracting a small but representative summary from a massive dataset. Often, such problems can be posed as maximizing a submodular set function subject to a cardinality constraint. We consider this question in the streaming setting, where elements arrive…
▽ More
Many tasks in machine learning and data mining, such as data diversification, non-parametric learning, kernel machines, clustering etc., require extracting a small but representative summary from a massive dataset. Often, such problems can be posed as maximizing a submodular set function subject to a cardinality constraint. We consider this question in the streaming setting, where elements arrive over time at a fast pace and thus we need to design an efficient, low-memory algorithm. One such method, proposed by Badanidiyuru et al. (2014), always finds a $0.5$-approximate solution. Can this approximation factor be improved? We answer this question affirmatively by designing a new algorithm SALSA for streaming submodular maximization. It is the first low-memory, single-pass algorithm that improves the factor $0.5$, under the natural assumption that elements arrive in a random order. We also show that this assumption is necessary, i.e., that there is no such algorithm with better than $0.5$-approximation when elements arrive in arbitrary order. Our experiments demonstrate that SALSA significantly outperforms the state of the art in applications related to exemplar-based clustering, social graph analysis, and recommender systems.
△ Less
Submitted 6 August, 2018;
originally announced August 2018.
-
Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees
Authors:
Haim Avron,
Michael Kapralov,
Cameron Musco,
Christopher Musco,
Ameya Velingker,
Amir Zandieh
Abstract:
Random Fourier features is one of the most popular techniques for scaling up kernel methods, such as kernel ridge regression. However, despite impressive empirical results, the statistical properties of random Fourier features are still not well understood. In this paper we take steps toward filling this gap. Specifically, we approach random Fourier features from a spectral matrix approximation po…
▽ More
Random Fourier features is one of the most popular techniques for scaling up kernel methods, such as kernel ridge regression. However, despite impressive empirical results, the statistical properties of random Fourier features are still not well understood. In this paper we take steps toward filling this gap. Specifically, we approach random Fourier features from a spectral matrix approximation point of view, give tight bounds on the number of Fourier features required to achieve a spectral approximation, and show how spectral matrix approximation bounds imply statistical guarantees for kernel ridge regression.
Qualitatively, our results are twofold: on the one hand, we show that random Fourier feature approximation can provably speed up kernel ridge regression under reasonable assumptions. At the same time, we show that the method is suboptimal, and sampling from a modified distribution in Fourier space, given by the leverage function of the kernel, yields provably better performance. We study this optimal sampling distribution for the Gaussian kernel, achieving a nearly complete characterization for the case of low-dimensional bounded datasets. Based on this characterization, we propose an efficient sampling scheme with guarantees superior to random Fourier features in this regime.
△ Less
Submitted 21 May, 2018; v1 submitted 26 April, 2018;
originally announced April 2018.
-
An Adaptive Sublinear-Time Block Sparse Fourier Transform
Authors:
Volkan Cevher,
Michael Kapralov,
Jonathan Scarlett,
Amir Zandieh
Abstract:
The problem of approximately computing the $k$ dominant Fourier coefficients of a vector $X$ quickly, and using few samples in time domain, is known as the Sparse Fourier Transform (sparse FFT) problem. A long line of work on the sparse FFT has resulted in algorithms with $O(k\log n\log (n/k))$ runtime [Hassanieh et al., STOC'12] and $O(k\log n)$ sample complexity [Indyk et al., FOCS'14]. These re…
▽ More
The problem of approximately computing the $k$ dominant Fourier coefficients of a vector $X$ quickly, and using few samples in time domain, is known as the Sparse Fourier Transform (sparse FFT) problem. A long line of work on the sparse FFT has resulted in algorithms with $O(k\log n\log (n/k))$ runtime [Hassanieh et al., STOC'12] and $O(k\log n)$ sample complexity [Indyk et al., FOCS'14]. These results are proved using non-adaptive algorithms, and the latter $O(k\log n)$ sample complexity result is essentially the best possible under the sparsity assumption alone.
This paper revisits the sparse FFT problem with the added twist that the sparse coefficients approximately obey a $(k_0,k_1)$-block sparse model. In this model, signal frequencies are clustered in $k_0$ intervals with width $k_1$ in Fourier space, where $k= k_0k_1$ is the total sparsity. Signals arising in applications are often well approximated by this model with $k_0\ll k$.
Our main result is the first sparse FFT algorithm for $(k_0, k_1)$-block sparse signals with the sample complexity of $O^*(k_0k_1 + k_0\log(1+ k_0)\log n)$ at constant signal-to-noise ratios, and sublinear runtime. A similar sample complexity was previously achieved in the works on model-based compressive sensing using random Gaussian measurements, but used $Ω(n)$ runtime. To the best of our knowledge, our result is the first sublinear-time algorithm for model based compressed sensing, and the first sparse FFT result that goes below the $O(k\log n)$ sample complexity bound.
Our algorithm crucially uses {\em adaptivity} to achieve the improved sample complexity bound, and we prove that adaptivity is in fact necessary if Fourier measurements are used: Any non-adaptive algorithm must use $Ω(k_0k_1\log \frac{n}{k_0k_1})$ samples for the $(k_0,k_1$)-block sparse model, ruling out improvements over the vanilla sparsity assumption.
△ Less
Submitted 11 April, 2017; v1 submitted 4 February, 2017;
originally announced February 2017.
-
Reconstruction of Sub-Nyquist Random Sampling for Sparse and Multi-Band Signals
Authors:
Amir Zandieh,
Alireza Zareian,
Masoumeh Azghani,
Farokh Marvasti
Abstract:
As technology grows, higher frequency signals are required to be processed in various applications. In order to digitize such signals, conventional analog to digital convertors are facing implementation challenges due to the higher sampling rates. Hence, lower sampling rates (i.e., sub-Nyquist) are considered to be cost efficient. A well-known approach is to consider sparse signals that have fewer…
▽ More
As technology grows, higher frequency signals are required to be processed in various applications. In order to digitize such signals, conventional analog to digital convertors are facing implementation challenges due to the higher sampling rates. Hence, lower sampling rates (i.e., sub-Nyquist) are considered to be cost efficient. A well-known approach is to consider sparse signals that have fewer nonzero frequency components compared to the highest frequency component. For the prior knowledge of the sparse positions, well-established methods already exist. However, there are applications where such information is not available. For such cases, a number of approaches have recently been proposed. In this paper, we propose several random sampling recovery algorithms which do not require any anti-aliasing filter. Moreover, we offer certain conditions under which these recovery techniques converge to the signal. Finally, we also confirm the performance of the above methods through extensive simulations.
△ Less
Submitted 26 November, 2014; v1 submitted 8 November, 2014;
originally announced November 2014.