-
Depth Dependence of $μ$P Learning Rates in ReLU MLPs
Authors:
Samy Jelassi,
Boris Hanin,
Ziwei Ji,
Sashank J. Reddi,
Srinadh Bhojanapalli,
Sanjiv Kumar
Abstract:
In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization. Our purpose is to study the dependence on $n$ and $L$ of the maximal update ($μ$P) learning rate, the largest learning rate for which the mean squared change in pre-activations after one step of gradient descent remains uniformly bounded at large $n,L$. A…
▽ More
In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization. Our purpose is to study the dependence on $n$ and $L$ of the maximal update ($μ$P) learning rate, the largest learning rate for which the mean squared change in pre-activations after one step of gradient descent remains uniformly bounded at large $n,L$. As in prior work on $μ$P of Yang et. al., we find that this maximal update learning rate is independent of $n$ for all but the first and last layer weights. However, we find that it has a non-trivial dependence of $L$, scaling like $L^{-3/2}.$
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
On student-teacher deviations in distillation: does it pay to disobey?
Authors:
Vaishnavh Nagarajan,
Aditya Krishna Menon,
Srinadh Bhojanapalli,
Hossein Mobahi,
Sanjiv Kumar
Abstract:
Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network. Yet, it has been shown in recent work that, despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo than the teacher in…
▽ More
Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network. Yet, it has been shown in recent work that, despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo than the teacher in performance. Our work aims to reconcile this seemingly paradoxical observation. Specifically, we characterize the precise nature of the student-teacher deviations, and argue how they can co-occur with better generalization. First, through experiments on image and language data, we identify that these probability deviations correspond to the student systematically exaggerating the confidence levels of the teacher. Next, we theoretically and empirically establish another form of exaggeration in some simple settings: KD exaggerates the implicit bias of gradient descent in converging faster along the top eigendirections of the data. Finally, we tie these two observations together: we demonstrate that the exaggerated bias of KD can simultaneously result in both (a) the exaggeration of confidence and (b) the improved generalization of the student, thus offering a resolution to the apparent paradox. Our analysis brings existing theory and practice closer by considering the role of gradient descent in KD and by demonstrating the exaggerated bias effect in both theoretical and empirical settings.
△ Less
Submitted 18 March, 2024; v1 submitted 30 January, 2023;
originally announced January 2023.
-
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
Authors:
Zonglin Li,
Chong You,
Srinadh Bhojanapalli,
Daliang Li,
Ankit Singh Rawat,
Sashank J. Reddi,
Ke Ye,
Felix Chern,
Felix Yu,
Ruiqi Guo,
Sanjiv Kumar
Abstract:
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by sparse we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP…
▽ More
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by sparse we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, as well as for other architectures including MLP-mixers and 2-layer MLPs. We show that sparsity also emerges using training datasets with random labels, or with random inputs, or with infinite amount of data, demonstrating that sparsity is not a result of a specific family of datasets. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.
△ Less
Submitted 9 June, 2023; v1 submitted 12 October, 2022;
originally announced October 2022.
-
Robust Training of Neural Networks Using Scale Invariant Architectures
Authors:
Zhiyuan Li,
Srinadh Bhojanapalli,
Manzil Zaheer,
Sashank J. Reddi,
Sanjiv Kumar
Abstract:
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve…
▽ More
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve both robust and memory-efficient training via the following general recipe: (1) modify the architecture and make it scale invariant, i.e. the scale of parameter doesn't affect the output of the network, (2) train with SGD and weight decay, and optionally (3) clip the global gradient norm proportional to weight norm multiplied by $\sqrt{\tfrac{2λ}η}$, where $η$ is learning rate and $λ$ is weight decay. We show that this general approach is robust to rescaling of parameter and loss by proving that its convergence only depends logarithmically on the scale of initialization and loss, whereas the standard SGD might not even converge for many initializations. Following our recipe, we design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam on downstream tasks.
△ Less
Submitted 18 July, 2022; v1 submitted 2 February, 2022;
originally announced February 2022.
-
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Authors:
Chulhee Yun,
Yin-Wen Chang,
Srinadh Bhojanapalli,
Ankit Singh Rawat,
Sashank J. Reddi,
Sanjiv Kumar
Abstract:
Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has prompted recent research into sparse Transformers that sparsify the connections in the attention layers. While empirically promising for long sequences, fundamental…
▽ More
Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has prompted recent research into sparse Transformers that sparsify the connections in the attention layers. While empirically promising for long sequences, fundamental questions remain unanswered: Can sparse Transformers approximate any arbitrary sequence-to-sequence function, similar to their dense counterparts? How does the sparsity pattern and the sparsity level affect their performance? In this paper, we address these questions and provide a unifying framework that captures existing sparse attention models. We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function. Surprisingly, our results show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n^2$ connections. Lastly, we present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
△ Less
Submitted 19 December, 2020; v1 submitted 8 June, 2020;
originally announced June 2020.
-
Does label smoothing mitigate label noise?
Authors:
Michal Lukasik,
Srinadh Bhojanapalli,
Aditya Krishna Menon,
Sanjiv Kumar
Abstract:
Label smoothing is commonly used in training deep learning models, wherein one-hot training labels are mixed with uniform label vectors. Empirically, smoothing has been shown to improve both predictive performance and model calibration. In this paper, we study whether label smoothing is also effective as a means of co** with label noise. While label smoothing apparently amplifies this problem --…
▽ More
Label smoothing is commonly used in training deep learning models, wherein one-hot training labels are mixed with uniform label vectors. Empirically, smoothing has been shown to improve both predictive performance and model calibration. In this paper, we study whether label smoothing is also effective as a means of co** with label noise. While label smoothing apparently amplifies this problem --- being equivalent to injecting symmetric noise to the labels --- we show how it relates to a general family of loss-correction techniques from the label noise literature. Building on this connection, we show that label smoothing is competitive with loss-correction under label noise. Further, we show that when distilling models from noisy data, label smoothing of the teacher is beneficial; this is in contrast to recent findings for noise-free problems, and sheds further light on settings where label smoothing is beneficial.
△ Less
Submitted 5 March, 2020;
originally announced March 2020.
-
Low-Rank Bottleneck in Multi-head Attention Models
Authors:
Srinadh Bhojanapalli,
Chulhee Yun,
Ankit Singh Rawat,
Sashank J. Reddi,
Sanjiv Kumar
Abstract:
Attention based Transformer architecture has enabled significant advances in the field of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. Unfortunately, this leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one…
▽ More
Attention based Transformer architecture has enabled significant advances in the field of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. Unfortunately, this leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one of the important factors contributing to the large embedding size requirement. In particular, our analysis highlights that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads, causing this limitation. We further validate this in our experiments. As a solution we propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power. We empirically show that this allows us to train models with a relatively smaller embedding dimension and with better performance scaling.
△ Less
Submitted 17 February, 2020;
originally announced February 2020.
-
Are Transformers universal approximators of sequence-to-sequence functions?
Authors:
Chulhee Yun,
Srinadh Bhojanapalli,
Ankit Singh Rawat,
Sashank J. Reddi,
Sanjiv Kumar
Abstract:
Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using…
▽ More
Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual map**s of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other simpler alternatives to self-attention layers and empirically evaluate them.
△ Less
Submitted 24 February, 2020; v1 submitted 20 December, 2019;
originally announced December 2019.
-
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Authors:
Yang You,
**g Li,
Sashank Reddi,
Jonathan Hseu,
Sanjiv Kumar,
Srinadh Bhojanapalli,
Xiaodan Song,
James Demmel,
Kurt Keutzer,
Cho-Jui Hsieh
Abstract:
Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly fo…
▽ More
Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly for attention models like BERT, indicating that its performance gains are not consistent across tasks. In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches. Using this strategy, we develop a new layerwise adaptive large batch optimization technique called LAMB; we then provide convergence analysis of LAMB as well as LARS, showing convergence to a stationary point in general nonconvex settings. Our empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning. In particular, for BERT training, our optimizer enables use of very large batch sizes of 32868 without any degradation of performance. By increasing the batch size to the memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to just 76 minutes (Table 1). The LAMB implementation is available at https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py
△ Less
Submitted 3 January, 2020; v1 submitted 1 April, 2019;
originally announced April 2019.
-
Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks
Authors:
Behnam Neyshabur,
Zhiyuan Li,
Srinadh Bhojanapalli,
Yann LeCun,
Nathan Srebro
Abstract:
Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization. In this work we suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound…
▽ More
Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization. In this work we suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks. Our capacity bound correlates with the behavior of test error with increasing network sizes, and could potentially explain the improvement in generalization with over-parametrization. We further present a matching lower bound for the Rademacher complexity that improves over previous capacity lower bounds for neural networks.
△ Less
Submitted 30 May, 2018;
originally announced May 2018.
-
Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form
Authors:
Srinadh Bhojanapalli,
Nicolas Boumal,
Prateek Jain,
Praneeth Netrapalli
Abstract:
Semidefinite programs (SDP) are important in learning and combinatorial optimization with numerous applications. In pursuit of low-rank solutions and low complexity algorithms, we consider the Burer--Monteiro factorization approach for solving SDPs. We show that all approximate local optima are global optima for the penalty formulation of appropriately rank-constrained SDPs as long as the number o…
▽ More
Semidefinite programs (SDP) are important in learning and combinatorial optimization with numerous applications. In pursuit of low-rank solutions and low complexity algorithms, we consider the Burer--Monteiro factorization approach for solving SDPs. We show that all approximate local optima are global optima for the penalty formulation of appropriately rank-constrained SDPs as long as the number of constraints scales sub-quadratically with the desired rank of the optimal solution. Our result is based on a simple penalty function formulation of the rank-constrained SDP along with a smoothed analysis to avoid worst-case cost matrices. We particularize our results to two applications, namely, Max-Cut and matrix completion.
△ Less
Submitted 28 February, 2018;
originally announced March 2018.
-
Implicit Regularization in Matrix Factorization
Authors:
Suriya Gunasekar,
Blake Woodworth,
Srinadh Bhojanapalli,
Behnam Neyshabur,
Nathan Srebro
Abstract:
We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix $X$ with gradient descent on a factorization of $X$. We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution.
We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix $X$ with gradient descent on a factorization of $X$. We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution.
△ Less
Submitted 25 May, 2017;
originally announced May 2017.
-
Single Pass PCA of Matrix Products
Authors:
Shanshan Wu,
Srinadh Bhojanapalli,
Sujay Sanghavi,
Alexandros G. Dimakis
Abstract:
In this paper we present a new algorithm for computing a low rank approximation of the product $A^TB$ by taking only a single pass of the two matrices $A$ and $B$. The straightforward way to do this is to (a) first sketch $A$ and $B$ individually, and then (b) find the top components using PCA on the sketch. Our algorithm in contrast retains additional summary information about $A,B$ (e.g. row and…
▽ More
In this paper we present a new algorithm for computing a low rank approximation of the product $A^TB$ by taking only a single pass of the two matrices $A$ and $B$. The straightforward way to do this is to (a) first sketch $A$ and $B$ individually, and then (b) find the top components using PCA on the sketch. Our algorithm in contrast retains additional summary information about $A,B$ (e.g. row and column norms etc.) and uses this additional information to obtain an improved approximation from the sketches. Our main analytical result establishes a comparable spectral norm guarantee to existing two-pass methods; in addition we also provide results from an Apache Spark implementation that shows better computational and statistical performance on real-world and synthetic evaluation datasets.
△ Less
Submitted 26 October, 2016; v1 submitted 20 October, 2016;
originally announced October 2016.
-
Provable Burer-Monteiro factorization for a class of norm-constrained matrix problems
Authors:
Dohyung Park,
Anastasios Kyrillidis,
Srinadh Bhojanapalli,
Constantine Caramanis,
Sujay Sanghavi
Abstract:
We study the projected gradient descent method on low-rank matrix problems with a strongly convex objective. We use the Burer-Monteiro factorization approach to implicitly enforce low-rankness; such factorization introduces non-convexity in the objective. We focus on constraint sets that include both positive semi-definite (PSD) constraints and specific matrix norm-constraints. Such criteria appea…
▽ More
We study the projected gradient descent method on low-rank matrix problems with a strongly convex objective. We use the Burer-Monteiro factorization approach to implicitly enforce low-rankness; such factorization introduces non-convexity in the objective. We focus on constraint sets that include both positive semi-definite (PSD) constraints and specific matrix norm-constraints. Such criteria appear in quantum state tomography and phase retrieval applications.
We show that non-convex projected gradient descent favors local linear convergence in the factored space. We build our theory on a novel descent lemma, that non-trivially extends recent results on the unconstrained problem. The resulting algorithm is Projected Factored Gradient Descent, abbreviated as ProjFGD, and shows superior performance compared to state of the art on quantum state tomography and sparse phase retrieval applications.
△ Less
Submitted 1 October, 2016; v1 submitted 3 June, 2016;
originally announced June 2016.
-
Global Optimality of Local Search for Low Rank Matrix Recovery
Authors:
Srinadh Bhojanapalli,
Behnam Neyshabur,
Nathan Srebro
Abstract:
We show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements. With noisy measurements we show all local minima are very close to a global optimum. Together with a curvature bound at saddle points, this yields a polynomial time global convergence guarantee for stochastic gradient descent {\em from random…
▽ More
We show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements. With noisy measurements we show all local minima are very close to a global optimum. Together with a curvature bound at saddle points, this yields a polynomial time global convergence guarantee for stochastic gradient descent {\em from random initialization}.
△ Less
Submitted 26 May, 2016; v1 submitted 23 May, 2016;
originally announced May 2016.
-
Drop** Convexity for Faster Semi-definite Optimization
Authors:
Srinadh Bhojanapalli,
Anastasios Kyrillidis,
Sujay Sanghavi
Abstract:
We study the minimization of a convex function $f(X)$ over the set of $n\times n$ positive semi-definite matrices, but when the problem is recast as $\min_U g(U) := f(UU^\top)$, with $U \in \mathbb{R}^{n \times r}$ and $r \leq n$. We study the performance of gradient descent on $g$---which we refer to as Factored Gradient Descent (FGD)---under standard assumptions on the original function $f$.
W…
▽ More
We study the minimization of a convex function $f(X)$ over the set of $n\times n$ positive semi-definite matrices, but when the problem is recast as $\min_U g(U) := f(UU^\top)$, with $U \in \mathbb{R}^{n \times r}$ and $r \leq n$. We study the performance of gradient descent on $g$---which we refer to as Factored Gradient Descent (FGD)---under standard assumptions on the original function $f$.
We provide a rule for selecting the step size and, with this choice, show that the local convergence rate of FGD mirrors that of standard gradient descent on the original $f$: i.e., after $k$ steps, the error is $O(1/k)$ for smooth $f$, and exponentially small in $k$ when $f$ is (restricted) strongly convex. In addition, we provide a procedure to initialize FGD for (restricted) strongly convex objectives and when one only has access to $f$ via a first-order oracle; for several problem instances, such proper initialization leads to global convergence guarantees.
FGD and similar procedures are widely used in practice for problems that can be posed as matrix factorization. To the best of our knowledge, this is the first paper to provide precise convergence rate guarantees for general convex functions under standard convex assumptions.
△ Less
Submitted 15 April, 2016; v1 submitted 13 September, 2015;
originally announced September 2015.
-
A New Sampling Technique for Tensors
Authors:
Srinadh Bhojanapalli,
Sujay Sanghavi
Abstract:
In this paper we propose new techniques to sample arbitrary third-order tensors, with an objective of speeding up tensor algorithms that have recently gained popularity in machine learning. Our main contribution is a new way to select, in a biased random way, only $O(n^{1.5}/ε^2)$ of the possible $n^3$ elements while still achieving each of the three goals: \\ {\em (a) tensor sparsification}: for…
▽ More
In this paper we propose new techniques to sample arbitrary third-order tensors, with an objective of speeding up tensor algorithms that have recently gained popularity in machine learning. Our main contribution is a new way to select, in a biased random way, only $O(n^{1.5}/ε^2)$ of the possible $n^3$ elements while still achieving each of the three goals: \\ {\em (a) tensor sparsification}: for a tensor that has to be formed from arbitrary samples, compute very few elements to get a good spectral approximation, and for arbitrary orthogonal tensors {\em (b) tensor completion:} recover an exactly low-rank tensor from a small number of samples via alternating least squares, or {\em (c) tensor factorization:} approximating factors of a low-rank tensor corrupted by noise. \\ Our sampling can be used along with existing tensor-based algorithms to speed them up, removing the computational bottleneck in these methods.
△ Less
Submitted 19 February, 2015; v1 submitted 17 February, 2015;
originally announced February 2015.
-
Tighter Low-rank Approximation via Sampling the Leveraged Element
Authors:
Srinadh Bhojanapalli,
Prateek Jain,
Sujay Sanghavi
Abstract:
In this work, we propose a new randomized algorithm for computing a low-rank approximation to a given matrix. Taking an approach different from existing literature, our method first involves a specific biased sampling, with an element being chosen based on the leverage scores of its row and column, and then involves weighted alternating minimization over the factored form of the intended low-rank…
▽ More
In this work, we propose a new randomized algorithm for computing a low-rank approximation to a given matrix. Taking an approach different from existing literature, our method first involves a specific biased sampling, with an element being chosen based on the leverage scores of its row and column, and then involves weighted alternating minimization over the factored form of the intended low-rank matrix, to minimize error only on these samples. Our method can leverage input sparsity, yet produce approximations in {\em spectral} (as opposed to the weaker Frobenius) norm; this combines the best aspects of otherwise disparate current results, but with a dependence on the condition number $κ= σ_1/σ_r$. In particular we require $O(nnz(M) + \frac{nκ^2 r^5}{ε^2})$ computations to generate a rank-$r$ approximation to $M$ in spectral norm. In contrast, the best existing method requires $O(nnz(M)+ \frac{nr^2}{ε^4})$ time to compute an approximation in Frobenius norm. Besides the tightness in spectral norm, we have a better dependence on the error $ε$. Our method is naturally and highly parallelizable.
Our new approach enables two extensions that are interesting on their own. The first is a new method to directly compute a low-rank approximation (in efficient factored form) to the product of two given matrices; it computes a small random set of entries of the product, and then executes weighted alternating minimization (as before) on these. The sampling strategy is different because now we cannot access leverage scores of the product matrix (but instead have to work with input matrices). The second extension is an improved algorithm with smaller communication complexity for the distributed PCA setting (where each server has small set of rows of the matrix, and want to compute low rank approximation with small amount of communication with other servers).
△ Less
Submitted 14 October, 2014;
originally announced October 2014.
-
Universal Matrix Completion
Authors:
Srinadh Bhojanapalli,
Prateek Jain
Abstract:
The problem of low-rank matrix completion has recently generated a lot of interest leading to several results that offer exact solutions to the problem. However, in order to do so, these methods make assumptions that can be quite restrictive in practice. More specifically, the methods assume that: a) the observed indices are sampled uniformly at random, and b) for every new matrix, the observed in…
▽ More
The problem of low-rank matrix completion has recently generated a lot of interest leading to several results that offer exact solutions to the problem. However, in order to do so, these methods make assumptions that can be quite restrictive in practice. More specifically, the methods assume that: a) the observed indices are sampled uniformly at random, and b) for every new matrix, the observed indices are sampled afresh. In this work, we address these issues by providing a universal recovery guarantee for matrix completion that works for a variety of sampling schemes. In particular, we show that if the set of sampled indices come from the edges of a bipartite graph with large spectral gap (i.e. gap between the first and the second singular value), then the nuclear norm minimization based method exactly recovers all low-rank matrices that satisfy certain incoherence properties. Moreover, we also show that under certain stricter incoherence conditions, $O(nr^2)$ uniformly sampled entries are enough to recover any rank-$r$ $n\times n$ matrix, in contrast to the $O(nr\log n)$ sample complexity required by other matrix completion algorithms as well as existing analyses of the nuclear norm method.
△ Less
Submitted 11 July, 2014; v1 submitted 10 February, 2014;
originally announced February 2014.
-
Completing Any Low-rank Matrix, Provably
Authors:
Yudong Chen,
Srinadh Bhojanapalli,
Sujay Sanghavi,
Rachel Ward
Abstract:
Matrix completion, i.e., the exact and provable recovery of a low-rank matrix from a small subset of its elements, is currently only known to be possible if the matrix satisfies a restrictive structural constraint---known as {\em incoherence}---on its row and column spaces. In these cases, the subset of elements is sampled uniformly at random.
In this paper, we show that {\em any} rank-$ r $…
▽ More
Matrix completion, i.e., the exact and provable recovery of a low-rank matrix from a small subset of its elements, is currently only known to be possible if the matrix satisfies a restrictive structural constraint---known as {\em incoherence}---on its row and column spaces. In these cases, the subset of elements is sampled uniformly at random.
In this paper, we show that {\em any} rank-$ r $ $ n$-by-$ n $ matrix can be exactly recovered from as few as $O(nr \log^2 n)$ randomly chosen elements, provided this random choice is made according to a {\em specific biased distribution}: the probability of any element being sampled should be proportional to the sum of the leverage scores of the corresponding row, and column. Perhaps equally important, we show that this specific form of sampling is nearly necessary, in a natural precise sense; this implies that other perhaps more intuitive sampling schemes fail.
We further establish three ways to use the above result for the setting when leverage scores are not known \textit{a priori}: (a) a sampling strategy for the case when only one of the row or column spaces are incoherent, (b) a two-phase sampling procedure for general matrices that first samples to estimate leverage scores followed by sampling for exact recovery, and (c) an analysis showing the advantages of weighted nuclear/trace-norm minimization over the vanilla un-weighted formulation for the case of non-uniform sampling.
△ Less
Submitted 21 July, 2014; v1 submitted 12 June, 2013;
originally announced June 2013.