Skip to main content

Showing 1–50 of 86 results for author: Kakade, S M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01100  [pdf, other

    cs.CL cs.LG

    Eliminating Position Bias of Language Models: A Mechanistic Approach

    Authors: Ziqi Wang, Hanlin Zhang, Xiner Li, Kuan-Hao Huang, Chi Han, Shuiwang Ji, Sham M. Kakade, Hao Peng, Heng Ji

    Abstract: Position bias has proven to be a prevalent issue of modern language models (LMs), where the models prioritize content based on its position within the given context. This bias often leads to unexpected model failures and hurts performance, robustness, and reliability across various applications. Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: 18 pages, 5 figures

  2. arXiv:2406.11741  [pdf, other

    cs.LG cs.AI

    Transcendence: Generative Models Can Outperform The Experts That Train Them

    Authors: Edwin Zhang, Vincent Zhu, Naomi Saphra, Anat Kleiman, Benjamin L. Edelman, Milind Tambe, Sham M. Kakade, Eran Malach

    Abstract: Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities… ▽ More

    Submitted 28 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: Code, models, and data at https://transcendence.eddie.win

  3. arXiv:2406.08466  [pdf, other

    cs.LG cs.AI math.ST stat.ML

    Scaling Laws in Linear Regression: Compute, Parameters, and Data

    Authors: Licong Lin, **gfeng Wu, Sham M. Kakade, Peter L. Bartlett, Jason D. Lee

    Abstract: Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, wh… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  4. arXiv:2405.18400  [pdf, other

    cs.CL cs.LG

    Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

    Authors: Ethan Shen, Alan Fan, Sarah M. Pratt, Jae Sung Park, Matthew Wallingford, Sham M. Kakade, Ari Holtzman, Ranjay Krishna, Ali Farhadi, Aditya Kusupati

    Abstract: Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing $k$ drafts to the user requires running an expensive language model $k$ times. To… ▽ More

    Submitted 24 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: 22 pages, 15 figures

  5. arXiv:2404.12376  [pdf, other

    cs.LG math.OC stat.ML

    Matching the Statistical Query Lower Bound for k-sparse Parity Problems with Stochastic Gradient Descent

    Authors: Yiwen Kou, Zixiang Chen, Quanquan Gu, Sham M. Kakade

    Abstract: The $k$-parity problem is a classical problem in computational complexity and algorithmic theory, serving as a key benchmark for understanding computational classes. In this paper, we solve the $k$-parity problem with stochastic gradient descent (SGD) on two-layer fully-connected neural networks. We demonstrate that SGD can efficiently solve the $k$-sparse parity problem on a $d$-dimensional hyper… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

    Comments: 36 pages, 7 figures, 3 tables

  6. arXiv:2402.01032  [pdf, other

    cs.LG cs.AI cs.CL

    Repeat After Me: Transformers are Better than State Space Models at Copying

    Authors: Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach

    Abstract: Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as "generalized state space models" (GSSMs). In this paper we show that while GSSMs are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks th… ▽ More

    Submitted 3 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  7. arXiv:2303.12287  [pdf, ps, other

    cs.LG cs.AI cs.GT stat.ML

    Hardness of Independent Learning and Sparse Equilibrium Computation in Markov Games

    Authors: Dylan J. Foster, Noah Golowich, Sham M. Kakade

    Abstract: We consider the problem of decentralized multi-agent reinforcement learning in Markov games. A fundamental question is whether there exist algorithms that, when adopted by all agents and run independently in a decentralized fashion, lead to no-regret for each player, analogous to celebrated convergence results in normal-form games. While recent work has shown that such algorithms exist for restric… ▽ More

    Submitted 21 March, 2023; originally announced March 2023.

    Comments: 51 pages

  8. arXiv:2303.02255  [pdf, other

    cs.LG math.OC stat.ML

    Finite-Sample Analysis of Learning High-Dimensional Single ReLU Neuron

    Authors: **gfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

    Abstract: This paper considers the problem of learning a single ReLU neuron with squared loss (a.k.a., ReLU regression) in the overparameterized regime, where the input dimension can exceed the number of samples. We analyze a Perceptron-type algorithm called GLM-tron (Kakade et al., 2011) and provide its dimension-free risk upper bounds for high-dimensional ReLU regression in both well-specified and misspec… ▽ More

    Submitted 26 June, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

    Comments: ICML 2023 camera ready

  9. arXiv:2302.14753  [pdf, other

    cs.LG cs.AI stat.ML

    Learning Hidden Markov Models Using Conditional Samples

    Authors: Sham M. Kakade, Akshay Krishnamurthy, Gaurav Mahajan, Cyril Zhang

    Abstract: This paper is concerned with the computational complexity of learning the Hidden Markov Model (HMM). Although HMMs are some of the most widely used tools in sequential and time series modeling, they are cryptographically hard to learn in the standard setting where one has access to i.i.d. samples of observation sequences. In this paper, we depart from this setup and consider an interactive access… ▽ More

    Submitted 24 February, 2024; v1 submitted 28 February, 2023; originally announced February 2023.

  10. arXiv:2210.09579  [pdf, other

    cs.LG cs.AI

    Unpacking Reward Sha**: Understanding the Benefits of Reward Engineering on Sample Complexity

    Authors: Abhishek Gupta, Aldo Pacchiano, Yuexiang Zhai, Sham M. Kakade, Sergey Levine

    Abstract: Reinforcement learning provides an automated framework for learning behaviors from high-level reward specifications, but in practice the choice of reward function can be crucial for good results -- while in principle the reward only needs to specify what the task is, in reality practitioners often need to design more detailed rewards that provide the agent with some hints about how the task should… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

  11. arXiv:2210.04157  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    The Role of Coverage in Online Reinforcement Learning

    Authors: Tengyang Xie, Dylan J. Foster, Yu Bai, Nan Jiang, Sham M. Kakade

    Abstract: Coverage conditions -- which assert that the data logging distribution adequately covers the state space -- play a fundamental role in determining the sample complexity of offline reinforcement learning. While such conditions might seem irrelevant to online reinforcement learning at first glance, we establish a new connection by showing -- somewhat surprisingly -- that the mere existence of a data… ▽ More

    Submitted 8 October, 2022; originally announced October 2022.

  12. arXiv:2210.03137  [pdf, other

    cs.LG math.OC

    Deep Inventory Management

    Authors: Dhruv Madeka, Kari Torkkola, Carson Eisenach, Anna Luo, Dean P. Foster, Sham M. Kakade

    Abstract: This work provides a Deep Reinforcement Learning approach to solving a periodic review inventory control system with stochastic vendor lead times, lost sales, correlated demand, and price matching. While this dynamic program has historically been considered intractable, our results show that several policy learning approaches are competitive with or outperform classical methods. In order to train… ▽ More

    Submitted 28 November, 2022; v1 submitted 6 October, 2022; originally announced October 2022.

  13. arXiv:2208.01857  [pdf, other

    cs.LG math.OC stat.ML

    The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift

    Authors: **gfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

    Abstract: We study linear regression under covariate shift, where the marginal distribution over the input covariates differs in the source and the target domains, while the conditional distribution of the output given the input covariates is similar across the two domains. We investigate a transfer learning approach with pretraining on the source data and finetuning based on the target data (both conducted… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: 32 pages, 1 figure, 1 table

  14. arXiv:2203.03159  [pdf, other

    cs.LG math.OC stat.ML

    Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime

    Authors: Difan Zou, **gfeng Wu, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

    Abstract: Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which… ▽ More

    Submitted 7 March, 2022; originally announced March 2022.

    Comments: 28 pages, 2 figures

  15. arXiv:2112.13487  [pdf, other

    cs.LG math.OC math.ST stat.ML

    The Statistical Complexity of Interactive Decision Making

    Authors: Dylan J. Foster, Sham M. Kakade, Jian Qian, Alexander Rakhlin

    Abstract: A fundamental challenge in interactive learning and decision making, ranging from bandit problems to reinforcement learning, is to provide sample-efficient, adaptive learning algorithms that achieve near-optimal regret. This question is analogous to the classical problem of optimal (supervised) statistical learning, where there are well-known complexity measures (e.g., VC dimension and Rademacher… ▽ More

    Submitted 11 July, 2023; v1 submitted 26 December, 2021; originally announced December 2021.

    Comments: Minor improvements to writing and organization

  16. arXiv:2110.06198  [pdf, other

    cs.LG math.OC stat.ML

    Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression

    Authors: **gfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

    Abstract: Stochastic gradient descent (SGD) has been shown to generalize well in many deep learning applications. In practice, one often runs SGD with a geometrically decaying stepsize, i.e., a constant initial stepsize followed by multiple geometric stepsize decay, and uses the last iterate as the output. This kind of SGD is known to be nearly minimax optimal for classical finite-dimensional linear regress… ▽ More

    Submitted 11 July, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

    Comments: 35 pages, 2 figures, 1 table. In ICML 2022

  17. arXiv:2108.04552  [pdf, other

    cs.LG math.OC stat.ML

    The Benefits of Implicit Regularization from SGD in Least Squares Problems

    Authors: Difan Zou, **gfeng Wu, Vladimir Braverman, Quanquan Gu, Dean P. Foster, Sham M. Kakade

    Abstract: Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches. In this work, we seek to understand these issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make s… ▽ More

    Submitted 10 July, 2022; v1 submitted 10 August, 2021; originally announced August 2021.

    Comments: 33 pages, 1 figure. In NeurIPS 2021

  18. arXiv:2107.06466  [pdf, other

    cs.LG stat.ML

    Going Beyond Linear RL: Sample Efficient Neural Function Approximation

    Authors: Baihe Huang, Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei, Runzhe Wang, Jiaqi Yang

    Abstract: Deep Reinforcement Learning (RL) powered by neural net approximation of the Q function has had enormous empirical success. While the theory of RL has traditionally focused on linear function approximation (or eluder dimension) approaches, little is known about nonlinear RL with neural net approximations of the Q functions. This is the focus of this work, where we study function approximation with… ▽ More

    Submitted 25 December, 2021; v1 submitted 13 July, 2021; originally announced July 2021.

  19. arXiv:2107.04518  [pdf, ps, other

    cs.LG stat.ML

    Optimal Gradient-based Algorithms for Non-concave Bandit Optimization

    Authors: Baihe Huang, Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei, Runzhe Wang, Jiaqi Yang

    Abstract: Bandit problems with linear or concave reward have been extensively studied, but relatively few works have studied bandits with non-concave reward. This work considers a large family of bandit problems where the unknown underlying reward function is non-concave, including the low-rank generalized linear bandit problems and two-layer neural network with polynomial activation bandit problem. For the… ▽ More

    Submitted 9 July, 2021; originally announced July 2021.

  20. arXiv:2107.02377  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    A Short Note on the Relationship of Information Gain and Eluder Dimension

    Authors: Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei

    Abstract: Eluder dimension and information gain are two widely used methods of complexity measures in bandit and reinforcement learning. Eluder dimension was originally proposed as a general complexity measure of function classes, but the common examples of where it is known to be small are function spaces (vector spaces). In these cases, the primary tool to upper bound the eluder dimension is the elliptic… ▽ More

    Submitted 6 July, 2021; originally announced July 2021.

  21. arXiv:2103.12692  [pdf, other

    cs.LG math.OC stat.ML

    Benign Overfitting of Constant-Stepsize SGD for Linear Regression

    Authors: Difan Zou, **gfeng Wu, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

    Abstract: There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see a benign overfitting phenomenon in overparameterized settings for natural learning algorithms, such as stochastic gradient descent (SGD), where little to no explicit regularization has been employed. This work considers this issue in arguably the most basic setting:… ▽ More

    Submitted 12 October, 2021; v1 submitted 23 March, 2021; originally announced March 2021.

    Comments: 56 pages, 2 figures. A short version is accepted at the 34th Annual Conference on Learning Theory (COLT 2021)

  22. arXiv:2103.12690  [pdf, other

    cs.LG cs.AI stat.ML

    An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap

    Authors: Yuanhao Wang, Ruosong Wang, Sham M. Kakade

    Abstract: A fundamental question in the theory of reinforcement learning is: suppose the optimal $Q$-function lies in the linear span of a given $d$ dimensional feature map**, is sample-efficient reinforcement learning (RL) possible? The recent and remarkable result of Weisz et al. (2020) resolved this question in the negative, providing an exponential (in $d$) sample size lower bound, which holds even if… ▽ More

    Submitted 19 October, 2021; v1 submitted 23 March, 2021; originally announced March 2021.

  23. arXiv:2103.10897  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    Bilinear Classes: A Structural Framework for Provable Generalization in RL

    Authors: Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, Ruosong Wang

    Abstract: This work introduces Bilinear Classes, a new structural framework, which permit generalization in reinforcement learning in a wide variety of settings through the use of function approximation. The framework incorporates nearly all existing models in which a polynomial sample complexity is achievable, and, notably, also includes new models, such as the Linear $Q^*/V^*$ model in which both the opti… ▽ More

    Submitted 11 July, 2021; v1 submitted 19 March, 2021; originally announced March 2021.

    Comments: Expanded extension section to include generalized linear bellman complete and changed related work

  24. arXiv:2103.04947  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Instabilities of Offline RL with Pre-Trained Neural Representation

    Authors: Ruosong Wang, Yifan Wu, Ruslan Salakhutdinov, Sham M. Kakade

    Abstract: In offline reinforcement learning (RL), we seek to utilize offline data to evaluate (or learn) policies in scenarios where the data are collected from a distribution that substantially differs from that of the target policy to be evaluated. Recent theoretical advances have shown that such sample-efficient offline RL is indeed possible provided certain strong representational conditions hold, else… ▽ More

    Submitted 8 March, 2021; originally announced March 2021.

  25. arXiv:2010.11895  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    What are the Statistical Limits of Offline RL with Linear Function Approximation?

    Authors: Ruosong Wang, Dean P. Foster, Sham M. Kakade

    Abstract: Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making p… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  26. arXiv:2007.07461  [pdf, ps, other

    cs.LG cs.GT cs.MA math.OC stat.ML

    Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

    Authors: Kaiqing Zhang, Sham M. Kakade, Tamer Başar, Lin F. Yang

    Abstract: Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intu… ▽ More

    Submitted 8 August, 2023; v1 submitted 14 July, 2020; originally announced July 2020.

    Comments: Updated version accepted to Journal of Machine Learning Research (JMLR)

  27. arXiv:2006.12484  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    Sample-Efficient Reinforcement Learning of Undercomplete POMDPs

    Authors: Chi **, Sham M. Kakade, Akshay Krishnamurthy, Qinghua Liu

    Abstract: Partial observability is a common challenge in many reinforcement learning applications, which requires an agent to maintain memory, infer latent states, and integrate this past information into exploration. This challenge leads to a number of computational and statistical hardness results for learning general Partially Observable Markov Decision Processes (POMDPs). This work shows that these hard… ▽ More

    Submitted 24 October, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

    Comments: To appear at NeurIPS 2020 as spotlight

  28. arXiv:2005.00527  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

    Authors: Ruosong Wang, Simon S. Du, Lin F. Yang, Sham M. Kakade

    Abstract: Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the number of episodes it takes to provably discover a policy whose value is $\varepsilon$ near to tha… ▽ More

    Submitted 9 July, 2020; v1 submitted 1 May, 2020; originally announced May 2020.

  29. arXiv:2002.09434  [pdf, ps, other

    cs.LG math.OC stat.ML

    Few-Shot Learning via Learning the Representation, Provably

    Authors: Simon S. Du, Wei Hu, Sham M. Kakade, Jason D. Lee, Qi Lei

    Abstract: This paper studies few-shot learning via representation learning, where one uses $T$ source tasks with $n_1$ data per task to learn a representation in order to reduce the sample complexity of a target task for which there is only $n_2 (\ll n_1)$ data. Specifically, we focus on the setting where there exists a good \emph{common representation} between source and target, and our goal is to understa… ▽ More

    Submitted 30 March, 2021; v1 submitted 21 February, 2020; originally announced February 2020.

    Comments: ICLR2021

  30. arXiv:1912.13445  [pdf, other

    stat.ML cs.CR cs.LG

    Robust Aggregation for Federated Learning

    Authors: Krishna Pillutla, Sham M. Kakade, Zaid Harchaoui

    Abstract: Federated learning is the centralized training of statistical models from decentralized data on mobile devices while preserving the privacy of each device. We present a robust aggregation approach to make federated learning robust to settings when a fraction of the devices may be sending corrupted updates to the server. The approach relies on a robust aggregation oracle based on the geometric medi… ▽ More

    Submitted 17 January, 2022; v1 submitted 31 December, 2019; originally announced December 2019.

    Journal ref: IEEE Transactions on Signal Processing 70 (2022): 1142-1154

  31. arXiv:1911.12568  [pdf, other

    cs.LG math.ST stat.ML

    Optimal Estimation of Change in a Population of Parameters

    Authors: Ramya Korlakai Vinayak, Weihao Kong, Sham M. Kakade

    Abstract: Paired estimation of change in parameters of interest over a population plays a central role in several application domains including those in the social sciences, epidemiology, medicine and biology. In these domains, the size of the population under study is often very large, however, the number of observations available per individual in the population is very small (\emph{sparse observations})… ▽ More

    Submitted 28 November, 2019; originally announced November 2019.

  32. arXiv:1911.12178  [pdf, ps, other

    cs.LG stat.ML

    The Nonstochastic Control Problem

    Authors: Elad Hazan, Sham M. Kakade, Karan Singh

    Abstract: We consider the problem of controlling an unknown linear dynamical system in the presence of (nonstochastic) adversarial perturbations and adversarial convex loss functions. In contrast to classical control, the a priori determination of an optimal controller here is hindered by the latter's dependence on the yet unknown perturbations and costs. Instead, we measure regret against an optimal linear… ▽ More

    Submitted 20 January, 2020; v1 submitted 27 November, 2019; originally announced November 2019.

    Comments: To appear at Algorithmic Learning Theory (ALT) 2020; small revisions from the last ver

  33. arXiv:1910.03016  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?

    Authors: Simon S. Du, Sham M. Kakade, Ruosong Wang, Lin F. Yang

    Abstract: Modern deep learning methods provide effective means to learn good representations. However, is a good representation itself sufficient for sample efficient reinforcement learning? This question has largely been studied only with respect to (worst-case) approximation error, in the more classical approximate dynamic programming literature. With regards to the statistical viewpoint, this question is… ▽ More

    Submitted 27 February, 2020; v1 submitted 7 October, 2019; originally announced October 2019.

    Comments: To appear in ICLR 2020

  34. arXiv:1908.00261  [pdf, ps, other

    cs.LG stat.ML

    On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

    Authors: Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan

    Abstract: Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric poli… ▽ More

    Submitted 14 October, 2020; v1 submitted 1 August, 2019; originally announced August 2019.

    Comments: Corollary 6.1 added for a cleaner comparison to prior work. $ε_{\mathrm{bias}}$ is now used instead of $ε_{\mathrm{approx}}$ to denote the transfer approximation error

  35. arXiv:1906.05664  [pdf, other

    cs.CL cs.LG stat.ML

    Calibration, Entropy Rates, and Memory in Language Models

    Authors: Mark Braverman, Xinyi Chen, Sham M. Kakade, Karthik Narasimhan, Cyril Zhang, Yi Zhang

    Abstract: Building accurate language models that capture meaningful long-term dependencies is a core challenge in natural language processing. Towards this end, we present a calibration-based approach to measure long-term discrepancies between a generative sequence model and the true distribution, and use these discrepancies to improve the model. Empirically, we show that state-of-the-art language models, i… ▽ More

    Submitted 11 June, 2019; originally announced June 2019.

  36. arXiv:1904.12838  [pdf, other

    cs.LG math.OC stat.ML

    The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

    Authors: Rong Ge, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli

    Abstract: Minimax optimal convergence rates for classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, SGD's final iterate behavior has received much less attention despite their widespread use in practice. Motivated by this observation, this work p… ▽ More

    Submitted 29 October, 2019; v1 submitted 29 April, 2019; originally announced April 2019.

    Comments: Appears in the proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019. 28 pages, 4 tables, 1 Algorithm, 7 figures

  37. arXiv:1902.08721  [pdf, ps, other

    cs.LG eess.SY math.OC stat.ML

    Online Control with Adversarial Disturbances

    Authors: Naman Agarwal, Brian Bullins, Elad Hazan, Sham M. Kakade, Karan Singh

    Abstract: We study the control of a linear dynamical system with adversarial disturbances (as opposed to statistical noise). The objective we consider is one of regret: we desire an online control procedure that can do nearly as well as that of a procedure that has full knowledge of the disturbances in hindsight. Our main result is an efficient algorithm that provides nearly tight regret bounds for this pro… ▽ More

    Submitted 22 February, 2019; originally announced February 2019.

  38. arXiv:1902.04811  [pdf, ps, other

    cs.LG math.OC stat.ML

    On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points

    Authors: Chi **, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, Michael I. Jordan

    Abstract: Gradient descent (GD) and stochastic gradient descent (SGD) are the workhorses of large-scale machine learning. While classical theory focused on analyzing the performance of these methods in convex optimization problems, the most notable successes in machine learning have involved nonconvex optimization, and a gap has arisen between theory and practice. Indeed, traditional analyses of GD and SGD… ▽ More

    Submitted 3 September, 2019; v1 submitted 13 February, 2019; originally announced February 2019.

    Comments: A preliminary version of this paper, with a subset of the results that are presented here, was presented at ICML 2017 (also as arXiv:1703.00887)

  39. arXiv:1902.04553  [pdf, ps, other

    math.ST cs.LG stat.ML

    Maximum Likelihood Estimation for Learning Populations of Parameters

    Authors: Ramya Korlakai Vinayak, Weihao Kong, Gregory Valiant, Sham M. Kakade

    Abstract: Consider a setting with $N$ independent individuals, each with an unknown parameter, $p_i \in [0, 1]$ drawn from some unknown distribution $P^\star$. After observing the outcomes of $t$ independent Bernoulli trials, i.e., $X_i \sim \text{Binomial}(t, p_i)$ per individual, our objective is to accurately estimate $P^\star$. This problem arises in numerous domains, including the social sciences, psyc… ▽ More

    Submitted 12 February, 2019; originally announced February 2019.

  40. arXiv:1902.03736  [pdf, ps, other

    math.PR cs.LG stat.ML

    A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm

    Authors: Chi **, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, Michael I. Jordan

    Abstract: In this note, we derive concentration inequalities for random vectors with subGaussian norm (a generalization of both subGaussian random vectors and norm bounded random vectors), which are tight up to logarithmic factors.

    Submitted 11 February, 2019; originally announced February 2019.

  41. arXiv:1902.03228  [pdf, other

    stat.ML cs.LG math.OC

    A Smoother Way to Train Structured Prediction Models

    Authors: Krishna Pillutla, Vincent Roulet, Sham M. Kakade, Zaid Harchaoui

    Abstract: We present a framework to train a structured prediction model by performing smoothing on the inference algorithm it builds upon. Smoothing overcomes the non-smoothness inherent to the maximum margin structured prediction objective, and paves the way for the use of fast primal gradient-based optimization algorithms. We illustrate the proposed framework by develo** a novel primal incremental optim… ▽ More

    Submitted 8 February, 2019; originally announced February 2019.

    Comments: Short version appeared in Neural Information Processing Systems (NeurIPS) 2018

  42. arXiv:1812.02690  [pdf, other

    cs.LG cs.AI stat.ML

    Provably Efficient Maximum Entropy Exploration

    Authors: Elad Hazan, Sham M. Kakade, Karan Singh, Abby Van Soest

    Abstract: Suppose an agent is in a (possibly unknown) Markov Decision Process in the absence of a reward signal, what might we hope that an agent can efficiently learn to do? This work studies a broad class of objectives that are defined solely as functions of the state-visitation frequencies that are induced by how the agent behaves. For example, one natural, intrinsically defined, objective problem is for… ▽ More

    Submitted 25 January, 2019; v1 submitted 6 December, 2018; originally announced December 2018.

    Comments: Updated experiment results; minor revisions in writing

  43. arXiv:1811.08045  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Coupled Recurrent Models for Polyphonic Music Composition

    Authors: John Thickstun, Zaid Harchaoui, Dean P. Foster, Sham M. Kakade

    Abstract: This paper introduces a novel recurrent model for music composition that is tailored to the structure of polyphonic music. We propose an efficient new conditional probabilistic factorization of musical scores, viewing a score as a collection of concurrent, coupled sequences: i.e. voices. To model the conditional distributions, we borrow ideas from both convolutional and recurrent neural models; we… ▽ More

    Submitted 26 November, 2019; v1 submitted 19 November, 2018; originally announced November 2018.

    Comments: 13 pages; long version of the paper appearing in ISMIR 2019

  44. arXiv:1803.05591  [pdf, other

    cs.LG math.OC stat.ML

    On the insufficiency of existing momentum schemes for Stochastic Optimization

    Authors: Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, Sham M. Kakade

    Abstract: Momentum based stochastic gradient methods such as heavy ball (HB) and Nesterov's accelerated gradient descent (NAG) method are widely used in practice for training deep networks and other supervised learning models, as they often provide significant improvements over stochastic gradient descent (SGD). Rigorously speaking, "fast gradient" methods have provable improvements over gradient descent on… ▽ More

    Submitted 31 July, 2018; v1 submitted 15 March, 2018; originally announced March 2018.

    Comments: 28 pages, 10 figures. Updated acknowledgements. Appeared as an oral presentation at International Conference on Learning Representations (ICLR), 2018. Code implementing the ASGD method can be found at https://github.com/rahulkidambi/AccSGD

  45. arXiv:1801.05039  [pdf, other

    cs.LG stat.ML

    Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator

    Authors: Maryam Fazel, Rong Ge, Sham M. Kakade, Mehran Mesbahi

    Abstract: Direct policy gradient methods for reinforcement learning and continuous control problems are a popular approach for a variety of reasons: 1) they are easy to implement without explicit knowledge of the underlying model 2) they are an "end-to-end" approach, directly optimizing the performance metric of interest 3) they inherently allow for richly parameterized policies. A notable drawback is that… ▽ More

    Submitted 23 March, 2019; v1 submitted 15 January, 2018; originally announced January 2018.

  46. arXiv:1711.04845  [pdf, other

    stat.ML cs.LG cs.SD eess.AS

    Invariances and Data Augmentation for Supervised Music Transcription

    Authors: John Thickstun, Zaid Harchaoui, Dean Foster, Sham M. Kakade

    Abstract: This paper explores a variety of models for frame-based music transcription, with an emphasis on the methods needed to reach state-of-the-art on human recordings. The translation-invariant network discussed in this paper, which combines a traditional filterbank with a convolutional neural network, was the top-performing model in the 2017 MIREX Multiple Fundamental Frequency Estimation evaluation.… ▽ More

    Submitted 13 November, 2017; originally announced November 2017.

    Comments: 6 pages

  47. arXiv:1710.09430  [pdf, ps, other

    stat.ML cs.LG math.OC

    A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)

    Authors: Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Venkata Krishna Pillutla, Aaron Sidford

    Abstract: This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares. This result is obtained by analyzing SGD as a stochastic process and by sharply characterizing the stationary covariance matrix of this process. The finite rate optimality characterization captures the constant factors and addre… ▽ More

    Submitted 21 July, 2018; v1 submitted 25 October, 2017; originally announced October 2017.

    Comments: Lemma 1 has been updated in v2

  48. arXiv:1704.08227  [pdf, other

    stat.ML cs.LG math.OC math.ST

    Accelerating Stochastic Gradient Descent For Least Squares Regression

    Authors: Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Aaron Sidford

    Abstract: There is widespread sentiment that it is not possible to effectively utilize fast gradient methods (e.g. Nesterov's acceleration, conjugate gradient, heavy ball) for the purposes of stochastic optimization due to their instability and error accumulation, a notion made precise in d'Aspremont 2008 and Devolder, Glineur, and Nesterov 2014. This work considers these issues for the special case of stoc… ▽ More

    Submitted 31 July, 2018; v1 submitted 26 April, 2017; originally announced April 2017.

    Comments: 54 pages, 3 figures, 1 table; updated acknowledgements, minor title change. Paper appeared in the proceedings of the Conference on Learning Theory (COLT), 2018

  49. arXiv:1703.00887  [pdf, ps, other

    cs.LG math.OC stat.ML

    How to Escape Saddle Points Efficiently

    Authors: Chi **, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan

    Abstract: This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost "dimension-free"). The convergence rate of this procedure matches the well-known convergence rate of gradient descent to first-order stationary points, up to log factors. When all saddle points are no… ▽ More

    Submitted 2 March, 2017; originally announced March 2017.

  50. arXiv:1612.00516  [pdf, other

    stat.ML cs.LG

    Canonical Correlation Analysis for Analyzing Sequences of Medical Billing Codes

    Authors: Corinne L. Jones, Sham M. Kakade, Lucas W. Thornblade, David R. Flum, Abraham D. Flaxman

    Abstract: We propose using canonical correlation analysis (CCA) to generate features from sequences of medical billing codes. Applying this novel use of CCA to a database of medical billing codes for patients with diverticulitis, we first demonstrate that the CCA embeddings capture meaningful relationships among the codes. We then generate features from these embeddings and establish their usefulness in pre… ▽ More

    Submitted 6 January, 2017; v1 submitted 1 December, 2016; originally announced December 2016.

    Comments: Accepted at NIPS 2016 Workshop on Machine Learning for Health