Skip to main content

Showing 1–20 of 20 results for author: Bhojanapalli, S

Searching in archive stat. Search in all archives.
.
  1. arXiv:2305.07810  [pdf, ps, other

    cs.LG stat.ML

    Depth Dependence of $μ$P Learning Rates in ReLU MLPs

    Authors: Samy Jelassi, Boris Hanin, Ziwei Ji, Sashank J. Reddi, Srinadh Bhojanapalli, Sanjiv Kumar

    Abstract: In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization. Our purpose is to study the dependence on $n$ and $L$ of the maximal update ($μ$P) learning rate, the largest learning rate for which the mean squared change in pre-activations after one step of gradient descent remains uniformly bounded at large $n,L$. A… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

  2. arXiv:2301.12923  [pdf, other

    cs.LG cs.AI stat.ML

    On student-teacher deviations in distillation: does it pay to disobey?

    Authors: Vaishnavh Nagarajan, Aditya Krishna Menon, Srinadh Bhojanapalli, Hossein Mobahi, Sanjiv Kumar

    Abstract: Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network. Yet, it has been shown in recent work that, despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo than the teacher in… ▽ More

    Submitted 18 March, 2024; v1 submitted 30 January, 2023; originally announced January 2023.

  3. arXiv:2210.06313  [pdf, other

    cs.LG cs.CL cs.CV stat.ML

    The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

    Authors: Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar

    Abstract: This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by sparse we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP… ▽ More

    Submitted 9 June, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: A short version was presented at ICLR 2023. Previous title: Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

  4. arXiv:2202.00980  [pdf, other

    cs.LG stat.ML

    Robust Training of Neural Networks Using Scale Invariant Architectures

    Authors: Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi, Sanjiv Kumar

    Abstract: In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve… ▽ More

    Submitted 18 July, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

    Comments: 36 pages, 7 figures; ICML 2022

  5. arXiv:2006.04862  [pdf, other

    cs.LG stat.ML

    $O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

    Authors: Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has prompted recent research into sparse Transformers that sparsify the connections in the attention layers. While empirically promising for long sequences, fundamental… ▽ More

    Submitted 19 December, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

    Comments: 31 pages, NeurIPS 2020 Camera-ready

  6. arXiv:2003.02819  [pdf, other

    cs.LG stat.ML

    Does label smoothing mitigate label noise?

    Authors: Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, Sanjiv Kumar

    Abstract: Label smoothing is commonly used in training deep learning models, wherein one-hot training labels are mixed with uniform label vectors. Empirically, smoothing has been shown to improve both predictive performance and model calibration. In this paper, we study whether label smoothing is also effective as a means of co** with label noise. While label smoothing apparently amplifies this problem --… ▽ More

    Submitted 5 March, 2020; originally announced March 2020.

  7. arXiv:2002.07028  [pdf, other

    cs.LG stat.ML

    Low-Rank Bottleneck in Multi-head Attention Models

    Authors: Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Attention based Transformer architecture has enabled significant advances in the field of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. Unfortunately, this leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one… ▽ More

    Submitted 17 February, 2020; originally announced February 2020.

    Comments: 17 pages, 4 figures

  8. arXiv:1912.10077  [pdf, other

    cs.LG stat.ML

    Are Transformers universal approximators of sequence-to-sequence functions?

    Authors: Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using… ▽ More

    Submitted 24 February, 2020; v1 submitted 20 December, 2019; originally announced December 2019.

    Comments: 23 pages, ICLR 2020 camera-ready version

  9. arXiv:1904.00962  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

    Authors: Yang You, **g Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh

    Abstract: Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly fo… ▽ More

    Submitted 3 January, 2020; v1 submitted 1 April, 2019; originally announced April 2019.

    Comments: Published as a conference paper at ICLR 2020

  10. arXiv:1805.12076  [pdf, other

    cs.LG stat.ML

    Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks

    Authors: Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, Nathan Srebro

    Abstract: Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization. In this work we suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound… ▽ More

    Submitted 30 May, 2018; originally announced May 2018.

    Comments: 19 pages, 8 figures

  11. arXiv:1803.00186  [pdf, ps, other

    stat.ML cs.LG math.OC

    Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form

    Authors: Srinadh Bhojanapalli, Nicolas Boumal, Prateek Jain, Praneeth Netrapalli

    Abstract: Semidefinite programs (SDP) are important in learning and combinatorial optimization with numerous applications. In pursuit of low-rank solutions and low complexity algorithms, we consider the Burer--Monteiro factorization approach for solving SDPs. We show that all approximate local optima are global optima for the penalty formulation of appropriately rank-constrained SDPs as long as the number o… ▽ More

    Submitted 28 February, 2018; originally announced March 2018.

    Comments: 24 pages

  12. arXiv:1705.09280  [pdf, other

    stat.ML cs.LG

    Implicit Regularization in Matrix Factorization

    Authors: Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro

    Abstract: We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix $X$ with gradient descent on a factorization of $X$. We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution.

    Submitted 25 May, 2017; originally announced May 2017.

  13. arXiv:1610.06656  [pdf, ps, other

    stat.ML cs.DS cs.IT cs.LG

    Single Pass PCA of Matrix Products

    Authors: Shanshan Wu, Srinadh Bhojanapalli, Sujay Sanghavi, Alexandros G. Dimakis

    Abstract: In this paper we present a new algorithm for computing a low rank approximation of the product $A^TB$ by taking only a single pass of the two matrices $A$ and $B$. The straightforward way to do this is to (a) first sketch $A$ and $B$ individually, and then (b) find the top components using PCA on the sketch. Our algorithm in contrast retains additional summary information about $A,B$ (e.g. row and… ▽ More

    Submitted 26 October, 2016; v1 submitted 20 October, 2016; originally announced October 2016.

    Comments: 24 pages, 4 figures, NIPS 2016

  14. arXiv:1606.01316  [pdf, other

    stat.ML cs.DS cs.IT math.NA math.OC

    Provable Burer-Monteiro factorization for a class of norm-constrained matrix problems

    Authors: Dohyung Park, Anastasios Kyrillidis, Srinadh Bhojanapalli, Constantine Caramanis, Sujay Sanghavi

    Abstract: We study the projected gradient descent method on low-rank matrix problems with a strongly convex objective. We use the Burer-Monteiro factorization approach to implicitly enforce low-rankness; such factorization introduces non-convexity in the objective. We focus on constraint sets that include both positive semi-definite (PSD) constraints and specific matrix norm-constraints. Such criteria appea… ▽ More

    Submitted 1 October, 2016; v1 submitted 3 June, 2016; originally announced June 2016.

    Comments: 28 pages

  15. arXiv:1605.07221  [pdf, other

    stat.ML cs.LG math.OC

    Global Optimality of Local Search for Low Rank Matrix Recovery

    Authors: Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro

    Abstract: We show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements. With noisy measurements we show all local minima are very close to a global optimum. Together with a curvature bound at saddle points, this yields a polynomial time global convergence guarantee for stochastic gradient descent {\em from random… ▽ More

    Submitted 26 May, 2016; v1 submitted 23 May, 2016; originally announced May 2016.

    Comments: 21 pages, 3 figures

  16. arXiv:1509.03917  [pdf, other

    stat.ML cs.DS cs.IT cs.LG math.NA math.OC

    Drop** Convexity for Faster Semi-definite Optimization

    Authors: Srinadh Bhojanapalli, Anastasios Kyrillidis, Sujay Sanghavi

    Abstract: We study the minimization of a convex function $f(X)$ over the set of $n\times n$ positive semi-definite matrices, but when the problem is recast as $\min_U g(U) := f(UU^\top)$, with $U \in \mathbb{R}^{n \times r}$ and $r \leq n$. We study the performance of gradient descent on $g$---which we refer to as Factored Gradient Descent (FGD)---under standard assumptions on the original function $f$. W… ▽ More

    Submitted 15 April, 2016; v1 submitted 13 September, 2015; originally announced September 2015.

    Comments: 40 pages

  17. arXiv:1502.05023  [pdf, ps, other

    stat.ML cs.DS cs.IT cs.LG

    A New Sampling Technique for Tensors

    Authors: Srinadh Bhojanapalli, Sujay Sanghavi

    Abstract: In this paper we propose new techniques to sample arbitrary third-order tensors, with an objective of speeding up tensor algorithms that have recently gained popularity in machine learning. Our main contribution is a new way to select, in a biased random way, only $O(n^{1.5}/ε^2)$ of the possible $n^3$ elements while still achieving each of the three goals: \\ {\em (a) tensor sparsification}: for… ▽ More

    Submitted 19 February, 2015; v1 submitted 17 February, 2015; originally announced February 2015.

    Comments: 29 pages,3 figures

  18. arXiv:1410.3886  [pdf, ps, other

    cs.DS cs.LG stat.ML

    Tighter Low-rank Approximation via Sampling the Leveraged Element

    Authors: Srinadh Bhojanapalli, Prateek Jain, Sujay Sanghavi

    Abstract: In this work, we propose a new randomized algorithm for computing a low-rank approximation to a given matrix. Taking an approach different from existing literature, our method first involves a specific biased sampling, with an element being chosen based on the leverage scores of its row and column, and then involves weighted alternating minimization over the factored form of the intended low-rank… ▽ More

    Submitted 14 October, 2014; originally announced October 2014.

    Comments: 36 pages, 3 figures, Extended abstract to appear in the proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA15)

  19. arXiv:1402.2324  [pdf, ps, other

    stat.ML cs.IT cs.LG

    Universal Matrix Completion

    Authors: Srinadh Bhojanapalli, Prateek Jain

    Abstract: The problem of low-rank matrix completion has recently generated a lot of interest leading to several results that offer exact solutions to the problem. However, in order to do so, these methods make assumptions that can be quite restrictive in practice. More specifically, the methods assume that: a) the observed indices are sampled uniformly at random, and b) for every new matrix, the observed in… ▽ More

    Submitted 11 July, 2014; v1 submitted 10 February, 2014; originally announced February 2014.

    Comments: 22 pages, 2 figures

  20. arXiv:1306.2979  [pdf, other

    stat.ML cs.IT cs.LG

    Completing Any Low-rank Matrix, Provably

    Authors: Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, Rachel Ward

    Abstract: Matrix completion, i.e., the exact and provable recovery of a low-rank matrix from a small subset of its elements, is currently only known to be possible if the matrix satisfies a restrictive structural constraint---known as {\em incoherence}---on its row and column spaces. In these cases, the subset of elements is sampled uniformly at random. In this paper, we show that {\em any} rank-$ r $… ▽ More

    Submitted 21 July, 2014; v1 submitted 12 June, 2013; originally announced June 2013.

    Comments: Added a new necessary condition(Theorem 6) and a result on completion of row coherent matrices(Corollary 4). Partial results appeared in the International Conference on Machine Learning 2014, under the title 'Coherent Matrix Completion'. (34 pages, 4 figures)