Skip to main content

Showing 1–46 of 46 results for author: Reddi, S

.
  1. arXiv:2406.17968  [pdf, other

    cs.IR cs.AI cs.LG stat.ML

    Efficient Document Ranking with Learnable Late Interactions

    Authors: Ziwei Ji, Himanshu Jain, Andreas Veit, Sashank J. Reddi, Sadeep Jayasumana, Ankit Singh Rawat, Aditya Krishna Menon, Felix Yu, Sanjiv Kumar

    Abstract: Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings; usually, the former has higher quality while the latter benefits from lower latency. Recently, late-interaction models have been p… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  2. arXiv:2406.02469  [pdf, other

    cs.LG cs.CL

    Landscape-Aware Growing: The Power of a Little LAG

    Authors: Stefani Karp, Nikunj Saunshi, Sobhan Miryoosefi, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Recently, there has been increasing interest in efficient pretraining paradigms for training Transformer-based models. Several recent approaches use smaller models to initialize larger models in order to save computation (e.g., stacking and fusion). In this work, we study the fundamental question of how to select the best growing strategy from a given pool of growing strategies. Prior works have e… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  3. arXiv:2402.05913  [pdf, other

    cs.CL cs.LG

    Efficient Stagewise Pretraining via Progressive Subnetworks

    Authors: Abhishek Panigrahi, Nikunj Saunshi, Kaifeng Lyu, Sobhan Miryoosefi, Sashank Reddi, Satyen Kale, Sanjiv Kumar

    Abstract: Recent developments in large language models have sparked interest in efficient pretraining methods. A recent effective paradigm is to perform stage-wise training, where the size of the model is gradually increased over the course of training (e.g. gradual stacking (Reddi et al., 2023)). While the resource and wall-time savings are appealing, it has limitations, particularly the inability to evalu… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

  4. arXiv:2306.13239  [pdf, other

    cs.LG

    The Inductive Bias of Flatness Regularization for Deep Matrix Factorization

    Authors: Khashayar Gatmiry, Zhiyuan Li, Ching-Yao Chuang, Sashank Reddi, Tengyu Ma, Stefanie Jegelka

    Abstract: Recent works on over-parameterized neural networks have shown that the stochasticity in optimizers has the implicit regularization effect of minimizing the sharpness of the loss function (in particular, the trace of its Hessian) over the family zero-loss solutions. More explicit forms of flatness regularization also empirically improve the generalization performance. However, it remains unclear wh… ▽ More

    Submitted 22 June, 2023; originally announced June 2023.

  5. arXiv:2305.07810  [pdf, ps, other

    cs.LG stat.ML

    Depth Dependence of $μ$P Learning Rates in ReLU MLPs

    Authors: Samy Jelassi, Boris Hanin, Ziwei Ji, Sashank J. Reddi, Srinadh Bhojanapalli, Sanjiv Kumar

    Abstract: In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization. Our purpose is to study the dependence on $n$ and $L$ of the maximal update ($μ$P) learning rate, the largest learning rate for which the mean squared change in pre-activations after one step of gradient descent remains uniformly bounded at large $n,L$. A… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

  6. arXiv:2212.00309  [pdf, other

    cs.LG cs.CR

    Differentially Private Adaptive Optimization with Delayed Preconditioners

    Authors: Tian Li, Manzil Zaheer, Ken Ziyu Liu, Sashank J. Reddi, H. Brendan McMahan, Virginia Smith

    Abstract: Privacy noise may negate the benefits of using adaptive optimizers in differentially private model training. Prior works typically address this issue by using auxiliary information (e.g., public data) to boost the effectiveness of adaptive optimization. In this work, we explore techniques to estimate and efficiently adapt to gradient geometry in private adaptive optimization without auxiliary data… ▽ More

    Submitted 7 June, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

    Comments: Accepted by ICLR 2023

  7. arXiv:2211.03970  [pdf, other

    cs.LG math.OC

    On the Algorithmic Stability and Generalization of Adaptive Optimization Methods

    Authors: Han Nguyen, Hai Pham, Sashank J. Reddi, Barnabás Póczos

    Abstract: Despite their popularity in deep learning and machine learning in general, the theoretical properties of adaptive optimizers such as Adagrad, RMSProp, Adam or AdamW are not yet fully understood. In this paper, we develop a novel framework to study the stability and generalization of these optimization methods. Based on this framework, we show provable guarantees about such properties that depend h… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: 21 pages including appendix

  8. arXiv:2210.06313  [pdf, other

    cs.LG cs.CL cs.CV stat.ML

    The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

    Authors: Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar

    Abstract: This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by sparse we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP… ▽ More

    Submitted 9 June, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: A short version was presented at ICLR 2023. Previous title: Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

  9. arXiv:2202.05963  [pdf, other

    cs.LG cs.CR stat.ML

    Private Adaptive Optimization with Side Information

    Authors: Tian Li, Manzil Zaheer, Sashank J. Reddi, Virginia Smith

    Abstract: Adaptive optimization methods have become the default solvers for many machine learning tasks. Unfortunately, the benefits of adaptivity may degrade when training with differential privacy, as the noise added to ensure privacy reduces the effectiveness of the adaptive preconditioner. To this end, we propose AdaDPS, a general framework that uses non-sensitive side information to precondition the gr… ▽ More

    Submitted 24 June, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

    Comments: ICML 2022

  10. arXiv:2202.00980  [pdf, other

    cs.LG stat.ML

    Robust Training of Neural Networks Using Scale Invariant Architectures

    Authors: Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi, Sanjiv Kumar

    Abstract: In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve… ▽ More

    Submitted 18 July, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

    Comments: 36 pages, 7 figures; ICML 2022

  11. arXiv:2201.11865  [pdf, other

    cs.LG cs.DC

    FedLite: A Scalable Approach for Federated Learning on Resource-constrained Clients

    Authors: Jianyu Wang, Hang Qi, Ankit Singh Rawat, Sashank Reddi, Sagar Waghmare, Felix X. Yu, Gauri Joshi

    Abstract: In classical federated learning, the clients contribute to the overall training by communicating local updates for the underlying model on their private data to a coordinating server. However, updating and communicating the entire model becomes prohibitively expensive when resource-constrained clients collectively aim to train a large machine learning model. Split learning provides a natural solut… ▽ More

    Submitted 16 February, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

  12. arXiv:2107.06917  [pdf, other

    cs.LG

    A Field Guide to Federated Optimization

    Authors: Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H. Brendan McMahan, Blaise Aguera y Arcas, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, Suhas Diggavi, Hubert Eichner, Advait Gadhikar, Zachary Garrett, Antonious M. Girgis, Filip Hanzely, Andrew Hard, Chaoyang He, Samuel Horvath, Zhouyuan Huo, Alex Ingerman, Martin Jaggi, Tara Javidi, Peter Kairouz , et al. (28 additional authors not shown)

    Abstract: Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and… ▽ More

    Submitted 14 July, 2021; originally announced July 2021.

  13. arXiv:2105.05736  [pdf, other

    cs.LG stat.ML

    Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

    Authors: Ankit Singh Rawat, Aditya Krishna Menon, Wittawat Jitkrittum, Sadeep Jayasumana, Felix X. Yu, Sashank Reddi, Sanjiv Kumar

    Abstract: Negative sampling schemes enable efficient training given a large number of classes, by offering a means to approximate a computationally expensive loss function that takes all labels into account. In this paper, we present a new connection between these schemes and loss modification techniques for countering label imbalance. We show that different negative sampling schemes implicitly trade-off pe… ▽ More

    Submitted 12 May, 2021; originally announced May 2021.

    Comments: To appear in ICML 2021

  14. arXiv:2102.06849  [pdf, other

    cs.LG cs.AI stat.ML

    Distilling Double Descent

    Authors: Andrew Cotter, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, Sashank J. Reddi, Yichen Zhou

    Abstract: Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset. The most common explanations for why distillation "works" are predicated on the assumption that student is provided with \emph{soft} labels, \eg probabilities or confidences, from the teacher model. In this work, we show, that,… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

  15. arXiv:2011.08474  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    Federated Composite Optimization

    Authors: Honglin Yuan, Manzil Zaheer, Sashank Reddi

    Abstract: Federated Learning (FL) is a distributed learning paradigm that scales on-device learning collaboratively and privately. Standard FL algorithms such as FedAvg are primarily geared towards smooth unconstrained settings. In this paper, we study the Federated Composite Optimization (FCO) problem, in which the loss function contains a non-smooth regularizer. Such problems arise naturally in FL applica… ▽ More

    Submitted 5 June, 2021; v1 submitted 17 November, 2020; originally announced November 2020.

    Comments: Accepted to ICML 2021. Code repository see https://github.com/hongliny/FCO-ICML21

  16. arXiv:2008.03606  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning

    Authors: Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

    Abstract: Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon. In fact, obtaining an algorithm for FL which is uniformly better than simple centralized training has been a major open problem thus far. In this work, we propose a general algorithmic framework, Mime, which i) mitigates cl… ▽ More

    Submitted 8 June, 2021; v1 submitted 8 August, 2020; originally announced August 2020.

    Comments: Version 2 provides stronger theoretical results and more thorough experiments

    MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

  17. arXiv:2006.04862  [pdf, other

    cs.LG stat.ML

    $O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

    Authors: Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has prompted recent research into sparse Transformers that sparsify the connections in the attention layers. While empirically promising for long sequences, fundamental… ▽ More

    Submitted 19 December, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

    Comments: 31 pages, NeurIPS 2020 Camera-ready

  18. arXiv:2005.10419  [pdf, other

    cs.LG stat.ML

    Why distillation helps: a statistical perspective

    Authors: Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, Sanjiv Kumar

    Abstract: Knowledge distillation is a technique for improving the performance of a simple "student" model by replacing its one-hot training labels with a distribution over labels obtained from a complex "teacher" model. While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help? In this paper, we present a statistical perspective on distillation w… ▽ More

    Submitted 20 May, 2020; originally announced May 2020.

  19. arXiv:2004.10915  [pdf, other

    cs.LG stat.ML

    Doubly-stochastic mining for heterogeneous retrieval

    Authors: Ankit Singh Rawat, Aditya Krishna Menon, Andreas Veit, Felix Yu, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Modern retrieval problems are characterised by training sets with potentially billions of labels, and heterogeneous data distributions across subpopulations (e.g., users of a retrieval system may be from different countries), each of which poses a challenge. The first challenge concerns scalability: with a large number of labels, standard losses are difficult to optimise even on a single example.… ▽ More

    Submitted 22 April, 2020; originally announced April 2020.

  20. arXiv:2003.00295  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    Adaptive Federated Optimization

    Authors: Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, H. Brendan McMahan

    Abstract: Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have… ▽ More

    Submitted 8 September, 2021; v1 submitted 29 February, 2020; originally announced March 2020.

    Comments: Published as a conference paper at ICLR 2021

  21. arXiv:2002.08528  [pdf, other

    cs.LG math.OC stat.ML

    Adaptive Sampling Distributed Stochastic Variance Reduced Gradient for Heterogeneous Distributed Datasets

    Authors: Ilqar Ramazanli, Han Nguyen, Hai Pham, Sashank J. Reddi, Barnabas Poczos

    Abstract: We study distributed optimization algorithms for minimizing the average of \emph{heterogeneous} functions distributed across several machines with a focus on communication efficiency. In such settings, naively using the classical stochastic gradient descent (SGD) or its variants (e.g., SVRG) with a uniform sampling of machines typically yields poor performance. It often leads to the dependence of… ▽ More

    Submitted 17 November, 2020; v1 submitted 19 February, 2020; originally announced February 2020.

  22. arXiv:2002.07028  [pdf, other

    cs.LG stat.ML

    Low-Rank Bottleneck in Multi-head Attention Models

    Authors: Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Attention based Transformer architecture has enabled significant advances in the field of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. Unfortunately, this leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one… ▽ More

    Submitted 17 February, 2020; originally announced February 2020.

    Comments: 17 pages, 4 figures

  23. arXiv:1912.10077  [pdf, other

    cs.LG stat.ML

    Are Transformers universal approximators of sequence-to-sequence functions?

    Authors: Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using… ▽ More

    Submitted 24 February, 2020; v1 submitted 20 December, 2019; originally announced December 2019.

    Comments: 23 pages, ICLR 2020 camera-ready version

  24. arXiv:1912.03194  [pdf, other

    math.OC cs.LG

    Why are Adaptive Methods Good for Attention Models?

    Authors: **gzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, Suvrit Sra

    Abstract: While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a h… ▽ More

    Submitted 23 October, 2020; v1 submitted 6 December, 2019; originally announced December 2019.

  25. arXiv:1910.09464  [pdf, other

    cs.LG stat.ML

    Learning to Learn by Zeroth-Order Oracle

    Authors: Yangjun Ruan, Yuanhao Xiong, Sashank Reddi, Sanjiv Kumar, Cho-Jui Hsieh

    Abstract: In the learning to learn (L2L) framework, we cast the design of optimization algorithms as a machine learning problem and use deep neural networks to learn the update rules. In this paper, we extend the L2L framework to zeroth-order (ZO) optimization setting, where no explicit gradient information is available. Our learned optimizer, modeled as recurrent neural network (RNN), first approximates gr… ▽ More

    Submitted 7 February, 2020; v1 submitted 21 October, 2019; originally announced October 2019.

    Comments: Published as a conference paper at ICLR 2020

  26. arXiv:1910.06378  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    SCAFFOLD: Stochastic Controlled Averaging for Federated Learning

    Authors: Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

    Abstract: Federated Averaging (FedAvg) has emerged as the algorithm of choice for federated learning due to its simplicity and low communication cost. However, in spite of recent research efforts, its performance is not fully understood. We obtain tight convergence rates for FedAvg and prove that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow converge… ▽ More

    Submitted 9 April, 2021; v1 submitted 14 October, 2019; originally announced October 2019.

    Comments: v2 contains analysis of FedAvg, non-convex rates of Scaffold, and experimental evaluation. v3 fixes typos, ICML version. v4 slightly improves rate of SCAFFOLD for general convex functions

    MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

  27. arXiv:1908.07643  [pdf, other

    cs.LG cs.CR stat.ML

    AdaCliP: Adaptive Clip** for Private SGD

    Authors: Venkatadheeraj Pichapati, Ananda Theertha Suresh, Felix X. Yu, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Privacy preserving machine learning algorithms are crucial for learning models over user data to protect sensitive information. Motivated by this, differentially private stochastic gradient descent (SGD) algorithms for training machine learning models have been proposed. At each step, these algorithms modify the gradients and add noise proportional to the sensitivity of the modified gradients. Und… ▽ More

    Submitted 23 October, 2019; v1 submitted 20 August, 2019; originally announced August 2019.

  28. arXiv:1904.09237  [pdf, other

    cs.LG math.OC stat.ML

    On the Convergence of Adam and Beyond

    Authors: Sashank J. Reddi, Satyen Kale, Sanjiv Kumar

    Abstract: Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to co… ▽ More

    Submitted 19 April, 2019; originally announced April 2019.

    Comments: Appeared in ICLR 2018

  29. arXiv:1904.00962  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

    Authors: Yang You, **g Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh

    Abstract: Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly fo… ▽ More

    Submitted 3 January, 2020; v1 submitted 1 April, 2019; originally announced April 2019.

    Comments: Published as a conference paper at ICLR 2020

  30. arXiv:1901.09149  [pdf, other

    cs.LG math.OC stat.ML

    Esca** Saddle Points with Adaptive Gradient Methods

    Authors: Matthew Staib, Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra

    Abstract: Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own,… ▽ More

    Submitted 3 February, 2020; v1 submitted 25 January, 2019; originally announced January 2019.

    Comments: Update Theorem 4.1 and proof to use martingale concentration bounds, i.e. matrix Freedman

  31. arXiv:1810.07076  [pdf, ps, other

    cs.LG stat.ML

    Stochastic Negative Mining for Learning with Large Output Spaces

    Authors: Sashank J. Reddi, Satyen Kale, Felix Yu, Dan Holtmann-Rice, Jiecao Chen, Sanjiv Kumar

    Abstract: We consider the problem of retrieving the most relevant labels for a given input when the size of the output space is very large. Retrieval methods are modeled as set-valued classifiers which output a small set of classes for each input, and a mistake is made if the label is not in the output set. Despite its practical importance, a statistically principled, yet practical solution to this problem… ▽ More

    Submitted 16 October, 2018; originally announced October 2018.

  32. arXiv:1709.01434  [pdf, other

    cs.LG cs.AI

    A Generic Approach for Esca** Saddle points

    Authors: Sashank J Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhutdinov, Alexander J Smola

    Abstract: A central challenge to using first-order methods for optimizing nonconvex problems is the presence of saddle points. First-order methods often get stuck at saddle points, greatly deteriorating their performance. Typically, to escape from saddles one has to use second-order methods. However, most works on second-order methods rely extensively on expensive Hessian-based computations, making them imp… ▽ More

    Submitted 5 September, 2017; originally announced September 2017.

  33. arXiv:1608.06879  [pdf, other

    math.OC cs.LG stat.ML

    AIDE: Fast and Communication Efficient Distributed Optimization

    Authors: Sashank J. Reddi, Jakub Konečný, Peter Richtárik, Barnabás Póczós, Alex Smola

    Abstract: In this paper, we present two new communication-efficient methods for distributed minimization of an average of functions. The first algorithm is an inexact variant of the DANE algorithm that allows any local algorithm to return an approximate solution to a local subproblem. We show that such a strategy does not affect the theoretical guarantees of DANE significantly. In fact, our approach can be… ▽ More

    Submitted 24 August, 2016; originally announced August 2016.

  34. arXiv:1607.08254  [pdf, other

    math.OC cs.LG stat.ML

    Stochastic Frank-Wolfe Methods for Nonconvex Optimization

    Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

    Abstract: We study Frank-Wolfe methods for nonconvex stochastic and finite-sum optimization problems. Frank-Wolfe methods (in the convex case) have gained tremendous recent interest in machine learning and optimization communities due to their projection-free property and their ability to exploit structured constraints. However, our understanding of these algorithms in the nonconvex setting is fairly limite… ▽ More

    Submitted 29 July, 2016; v1 submitted 27 July, 2016; originally announced July 2016.

  35. arXiv:1605.07147  [pdf, other

    math.OC cs.LG

    Riemannian SVRG: Fast Stochastic Optimization on Riemannian Manifolds

    Authors: Hongyi Zhang, Sashank J. Reddi, Suvrit Sra

    Abstract: We study optimization of finite sums of geodesically smooth functions on Riemannian manifolds. Although variance reduction techniques for optimizing finite-sums have witnessed tremendous attention in the recent years, existing work is limited to vector space problems. We introduce Riemannian SVRG (RSVRG), a new variance reduced Riemannian optimization method. We analyze RSVRG for both geodesically… ▽ More

    Submitted 7 April, 2017; v1 submitted 23 May, 2016; originally announced May 2016.

    Comments: This is the final version that appeared in NIPS 2016. Our proof of Lemma 2 was incorrect in the previous arXiv version. (9 pages paper + 6 pages appendix)

    Journal ref: Advances in Neural Information Processing Systems 29 (NIPS 2016)

  36. arXiv:1605.06900  [pdf, other

    math.OC cs.LG stat.ML

    Fast Stochastic Methods for Nonsmooth Nonconvex Optimization

    Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

    Abstract: We analyze stochastic algorithms for optimizing nonconvex, nonsmooth finite-sum problems, where the nonconvex part is smooth and the nonsmooth part is convex. Surprisingly, unlike the smooth case, our knowledge of this fundamental problem is very limited. For example, it is not known whether the proximal stochastic gradient method with constant minibatch converges to a stationary point. To tackle… ▽ More

    Submitted 23 May, 2016; originally announced May 2016.

  37. arXiv:1603.06160  [pdf, other

    math.OC cs.LG cs.NE stat.ML

    Stochastic Variance Reduction for Nonconvex Optimization

    Authors: Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, Alex Smola

    Abstract: We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary po… ▽ More

    Submitted 4 April, 2016; v1 submitted 19 March, 2016; originally announced March 2016.

    Comments: Minor feedback changes

  38. arXiv:1603.06159  [pdf, other

    math.OC cs.LG stat.ML

    Fast Incremental Method for Nonconvex Optimization

    Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

    Abstract: We analyze a fast incremental aggregated gradient method for optimizing nonconvex problems of the form $\min_x \sum_i f_i(x)$. Specifically, we analyze the SAGA algorithm within an Incremental First-order Oracle framework, and show that it converges to a stationary point provably faster than both gradient descent and stochastic gradient descent. We also discuss a Polyak's special class of nonconve… ▽ More

    Submitted 19 March, 2016; originally announced March 2016.

  39. arXiv:1508.00655  [pdf, other

    math.ST cs.AI cs.IT cs.LG stat.ML

    Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing

    Authors: Aaditya Ramdas, Sashank J. Reddi, Barnabas Poczos, Aarti Singh, Larry Wasserman

    Abstract: Nonparametric two sample testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. We refer to the most common settings as mean difference alternatives (MDA), for testing differences only in first moments, and general difference alternatives (GDA), which is about testing for… ▽ More

    Submitted 4 August, 2015; originally announced August 2015.

    Comments: 35 pages, 4 figures

  40. arXiv:1506.06840  [pdf, other

    cs.LG stat.ML

    On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants

    Authors: Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, Alex Smola

    Abstract: We study optimization algorithms based on variance reduction for stochastic gradient descent (SGD). Remarkable recent progress has been made in this direction through development of algorithms like SAG, SVRG, SAGA. These algorithms have been shown to outperform SGD, both theoretically and empirically. However, asynchronous versions of these algorithms---a crucial requirement for modern large-scale… ▽ More

    Submitted 24 January, 2016; v1 submitted 22 June, 2015; originally announced June 2015.

  41. arXiv:1411.6314  [pdf, other

    math.ST cs.AI cs.IT cs.LG stat.ML

    On the High-dimensional Power of Linear-time Kernel Two-Sample Testing under Mean-difference Alternatives

    Authors: Aaditya Ramdas, Sashank J. Reddi, Barnabas Poczos, Aarti Singh, Larry Wasserman

    Abstract: Nonparametric two sample testing deals with the question of consistently deciding if two distributions are different, given samples from both, without making any parametric assumptions about the form of the distributions. The current literature is split into two kinds of tests - those which are consistent without any assumptions about how the distributions may differ (\textit{general} alternatives… ▽ More

    Submitted 23 November, 2014; originally announced November 2014.

    Comments: 25 pages, 5 figures

  42. arXiv:1409.2617  [pdf, other

    math.OC stat.ML

    Large-scale randomized-coordinate descent methods with non-separable linear constraints

    Authors: Sashank Reddi, Ahmed Hefny, Carlton Downey, Avinava Dubey, Suvrit Sra

    Abstract: We develop randomized (block) coordinate descent (CD) methods for linearly constrained convex optimization. Unlike most CD methods, we do not assume the constraints to be separable, but let them be coupled linearly. To our knowledge, ours is the first CD method that allows linear coupling constraints, without making the global iteration complexity have an exponential dependence on the number of co… ▽ More

    Submitted 10 June, 2015; v1 submitted 9 September, 2014; originally announced September 2014.

  43. arXiv:1406.2083  [pdf, other

    stat.ML cs.IT cs.LG math.ST stat.ME

    On the Decreasing Power of Kernel and Distance based Nonparametric Hypothesis Tests in High Dimensions

    Authors: Sashank J. Reddi, Aaditya Ramdas, Barnabás Póczos, Aarti Singh, Larry Wasserman

    Abstract: This paper is about two related decision theoretic problems, nonparametric two-sample testing and independence testing. There is a belief that two recently proposed solutions, based on kernels and distances between pairs of points, behave well in high-dimensional settings. We identify different sources of misconception that give rise to the above belief. Specifically, we differentiate the hardness… ▽ More

    Submitted 23 November, 2014; v1 submitted 9 June, 2014; originally announced June 2014.

    Comments: 19 pages, 9 figures, published in AAAI-15: The 29th AAAI Conference on Artificial Intelligence (with author order reversed from ArXiv)

  44. arXiv:1210.4882  [pdf

    cs.AI

    A Maximum Likelihood Approach For Selecting Sets of Alternatives

    Authors: Ariel D. Procaccia, Sashank J. Reddi, Nisarg Shah

    Abstract: We consider the problem of selecting a subset of alternatives given noisy evaluations of the relative strength of different alternatives. We wish to select a k-subset (for a given k) that provides a maximum likelihood estimate for one of several objectives, e.g., containing the strongest alternative. Although this problem is NP-hard, we show that when the noise level is sufficiently high, intuitiv… ▽ More

    Submitted 16 October, 2012; originally announced October 2012.

    Comments: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012)

    Report number: UAI-P-2012-PG-695-704

  45. arXiv:1210.4877  [pdf

    cs.GT cs.MA

    Incentive Decision Processes

    Authors: Sashank J. Reddi, Emma Brunskill

    Abstract: We consider Incentive Decision Processes, where a principal seeks to reduce its costs due to another agent's behavior, by offering incentives to the agent for alternate behavior. We focus on the case where a principal interacts with a greedy agent whose preferences are hidden and static. Though IDPs can be directly modeled as partially observable Markov decision processes (POMDP), we show that it… ▽ More

    Submitted 16 October, 2012; originally announced October 2012.

    Comments: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012)

    Report number: UAI-P-2012-PG-418-427

  46. arXiv:0808.1119  [pdf

    cs.CC cs.DM

    Graham's Schedules and the Number Partition Problem

    Authors: Seenu S. Reddi

    Abstract: We show the equivalence of the Number Partition Problem and the two processor scheduling problem. We establish a priori bounds on the completion times for the scheduling problem which are tighter than Graham's but almost on par with a posteriori bounds of Coffman and Sethi. We conclude the paper with a characterization of the asymptotic behavior of the scheduling problem which relates to the spr… ▽ More

    Submitted 7 August, 2008; originally announced August 2008.

    Comments: 6 pages, 1 table