Skip to main content

Showing 1–50 of 130 results for author: Lee, J D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19617  [pdf, ps, other

    cs.LG cs.IT math.OC

    Stochastic Zeroth-Order Optimization under Strongly Convexity and Lipschitz Hessian: Minimax Sample Complexity

    Authors: Qian Yu, Yining Wang, Baihe Huang, Qi Lei, Jason D. Lee

    Abstract: Optimization of convex functions under stochastic zeroth-order feedback has been a major and challenging question in online learning. In this work, we consider the problem of optimizing second-order smooth and strongly convex functions where the algorithm is only accessible to noisy evaluations of the objective function it queries. We provide the first tight characterization for the rate of the mi… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  2. arXiv:2406.08466  [pdf, other

    cs.LG cs.AI math.ST stat.ML

    Scaling Laws in Linear Regression: Compute, Parameters, and Data

    Authors: Licong Lin, **gfeng Wu, Sham M. Kakade, Peter L. Bartlett, Jason D. Lee

    Abstract: Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, wh… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  3. arXiv:2406.06893  [pdf, other

    stat.ML cs.IT cs.LG

    Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot

    Authors: Zixuan Wang, Stanley Wei, Daniel Hsu, Jason D. Lee

    Abstract: The transformer architecture has prevailed in various deep learning settings due to its exceptional capabilities to select and compose structural information. Motivated by these capabilities, Sanford et al. proposed the sparse token selection task, in which transformers excel while fully-connected networks (FCNs) fail in the worst case. Building upon that, we strengthen the FCN lower bound to an a… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  4. arXiv:2406.01581  [pdf, other

    cs.LG stat.ML

    Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit

    Authors: Jason D. Lee, Kazusato Oko, Taiji Suzuki, Denny Wu

    Abstract: We study the problem of gradient descent learning of a single-index target function $f_*(\boldsymbol{x}) = \textstyleσ_*\left(\langle\boldsymbol{x},\boldsymbolθ\rangle\right)$ under isotropic Gaussian data in $\mathbb{R}^d$, where the link function $σ_*:\mathbb{R}\to\mathbb{R}$ is an unknown degree $q$ polynomial with information exponent $p$ (defined as the lowest degree in the Hermite expansion)… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: 34 pages

  5. arXiv:2404.16767  [pdf, other

    cs.LG cs.CL cs.CV

    REBEL: Reinforcement Learning via Regressing Relative Rewards

    Authors: Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, Wen Sun

    Abstract: While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clip**), and is notorious for its sensitivity to the precise impleme… ▽ More

    Submitted 29 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: New experimental results on general chat

  6. arXiv:2404.08495  [pdf, other

    cs.LG cs.AI cs.CL

    Dataset Reset Policy Optimization for RLHF

    Authors: Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun

    Abstract: Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of r… ▽ More

    Submitted 16 April, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

    Comments: 28 pages, 6 tables, 3 Figures, 3 Algorithms

  7. arXiv:2404.05832  [pdf, other

    cs.HC eess.SY

    Human-Machine Interaction in Automated Vehicles: Reducing Voluntary Driver Intervention

    Authors: Xinzhi Zhong, Yang Zhou, Varshini Kamaraj, Zhenhao Zhou, Wissam Kontar, Dan Negrut, John D. Lee, Soyoung Ahn

    Abstract: This paper develops a novel car-following control method to reduce voluntary driver interventions and improve traffic stability in Automated Vehicles (AVs). Through a combination of experimental and empirical analysis, we show how voluntary driver interventions can instigate substantial traffic disturbances that are amplified along the traffic upstream. Motivated by these findings, we present a fr… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  8. arXiv:2403.10738  [pdf, ps, other

    cs.LG

    Horizon-Free Regret for Linear Markov Decision Processes

    Authors: Zihan Zhang, Jason D. Lee, Yuxin Chen, Simon S. Du

    Abstract: A recent line of works showed regret bounds in reinforcement learning (RL) can be (nearly) independent of planning horizon, a.k.a.~the horizon-free bounds. However, these regret bounds only apply to settings where a polynomial dependency on the size of transition model is allowed, such as tabular Markov Decision Process (MDP) and linear mixture MDP. We give the first horizon-free bound for the pop… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Published as a conference paper in ICLR 2024

  9. arXiv:2403.05529  [pdf, other

    cs.LG stat.ML

    Computational-Statistical Gaps in Gaussian Single-Index Models

    Authors: Alex Damian, Loucas Pillaud-Vivien, Jason D. Lee, Joan Bruna

    Abstract: Single-Index Models are high-dimensional regression problems with planted structure, whereby labels depend on an unknown one-dimensional projection of the input via a generic, non-linear, and potentially non-deterministic transformation. As such, they encompass a broad class of statistical inference tasks, and provide a rich template to study statistical and computational trade-offs in the high-di… ▽ More

    Submitted 12 March, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

    Comments: 61 pages

  10. arXiv:2403.03183  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    How Well Can Transformers Emulate In-context Newton's Method?

    Authors: Angeliki Giannou, Liu Yang, Tianhao Wang, Dimitris Papailiopoulos, Jason D. Lee

    Abstract: Transformer-based models have demonstrated remarkable in-context learning capabilities, prompting extensive research into its underlying mechanisms. Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression. In this work, we study whether Transformers can perform higher orde… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

  11. arXiv:2402.14735  [pdf, other

    cs.LG cs.IT stat.ML

    How Transformers Learn Causal Structure with Gradient Descent

    Authors: Eshaan Nichani, Alex Damian, Jason D. Lee

    Abstract: The incredible success of transformers on sequence modeling tasks can be largely attributed to the self-attention mechanism, which allows information to be transferred between different parts of a sequence. Self-attention allows transformers to encode causal structure which makes them particularly suitable for sequence modeling. However, the process by which transformers learn such causal structur… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

  12. arXiv:2402.11867  [pdf, other

    cs.LG math.OC

    LoRA Training in the NTK Regime has No Spurious Local Minima

    Authors: Uijeong Jang, Jason D. Lee, Ernest K. Ryu

    Abstract: Low-rank adaptation (LoRA) has become the standard approach for parameter-efficient fine-tuning of large language models (LLM), but our theoretical understanding of LoRA has been limited. In this work, we theoretically analyze LoRA fine-tuning in the neural tangent kernel (NTK) regime with $N$ data points, showing: (i) full fine-tuning (without LoRA) admits a low-rank solution of rank… ▽ More

    Submitted 28 May, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: 23 pages

  13. arXiv:2402.11592  [pdf, other

    cs.LG cs.CL

    Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

    Authors: Yihua Zhang, **zhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, Tianlong Chen

    Abstract: In the evolving landscape of natural language processing (NLP), fine-tuning pre-trained Large Language Models (LLMs) with first-order (FO) optimizers like SGD and Adam has become standard. Yet, as LLMs grow {in size}, the substantial memory overhead from back-propagation (BP) for FO gradient computation presents a significant challenge. Addressing this issue is crucial, especially for applications… ▽ More

    Submitted 27 May, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

  14. arXiv:2402.10193  [pdf, other

    cs.LG cs.CL

    BitDelta: Your Fine-Tune May Only Be Worth One Bit

    Authors: James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai

    Abstract: Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into t… ▽ More

    Submitted 27 February, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

  15. arXiv:2401.15530  [pdf, ps, other

    cs.LG cs.IT

    An Information-Theoretic Analysis of In-Context Learning

    Authors: Hong Jun Jeon, Jason D. Lee, Qi Lei, Benjamin Van Roy

    Abstract: Previous theoretical results pertaining to meta-learning on sequences build on contrived assumptions and are somewhat convoluted. We introduce new information-theoretic tools that lead to an elegant and very general decomposition of error into three components: irreducible error, meta-learning error, and intra-task error. These tools unify analyses across many meta-learning challenges. To illustra… ▽ More

    Submitted 27 January, 2024; originally announced January 2024.

  16. arXiv:2401.10774  [pdf, other

    cs.LG cs.CL

    Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    Authors: Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao

    Abstract: Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache. While methods such as speculative decoding have been suggested to address this issue, their implementa… ▽ More

    Submitted 14 June, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

    Comments: The code for this implementation is available at https://github.com/FasterDecoding/Medusa

  17. arXiv:2312.07930  [pdf, other

    cs.LG cs.CL cs.CR cs.IT stat.ML

    Towards Optimal Statistical Watermarking

    Authors: Baihe Huang, Hanlin Zhu, Banghua Zhu, Kannan Ramchandran, Michael I. Jordan, Jason D. Lee, Jiantao Jiao

    Abstract: We study statistical watermarking by formulating it as a hypothesis testing problem, a general framework which subsumes all previous statistical watermarking methods. Key to our formulation is a coupling of the output tokens and the rejection region, realized by pseudo-random generators in practice, that allows non-trivial trade-offs between the Type I error and Type II error. We characterize the… ▽ More

    Submitted 6 February, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

  18. arXiv:2312.05134  [pdf, other

    cs.LG stat.ML

    Optimal Multi-Distribution Learning

    Authors: Zihan Zhang, Wenhao Zhan, Yuxin Chen, Simon S. Du, Jason D. Lee

    Abstract: Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, etc. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process.… ▽ More

    Submitted 23 May, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

  19. arXiv:2312.00854  [pdf, other

    physics.med-ph cs.AI cs.LG math.NA stat.CO

    A Probabilistic Neural Twin for Treatment Planning in Peripheral Pulmonary Artery Stenosis

    Authors: John D. Lee, Jakob Richter, Martin R. Pfaller, Jason M. Szafron, Karthik Menon, Andrea Zanoni, Michael R. Ma, Jeffrey A. Feinstein, Jacqueline Kreutzer, Alison L. Marsden, Daniele E. Schiavazzi

    Abstract: The substantial computational cost of high-fidelity models in numerical hemodynamics has, so far, relegated their use mainly to offline treatment planning. New breakthroughs in data-driven architectures and optimization techniques for fast surrogate modeling provide an exciting opportunity to overcome these limitations, enabling the use of such technology for time-critical decisions. We discuss an… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

  20. arXiv:2311.18817  [pdf, other

    cs.LG cs.AI

    Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

    Authors: Kaifeng Lyu, Jikai **, Zhiyuan Li, Simon S. Du, Jason D. Lee, Wei Hu

    Abstract: Recent work by Power et al. (2022) highlighted a surprising "grokking" phenomenon in learning arithmetic tasks: a neural net first "memorizes" the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy. This paper studies the grokking phenomenon in theoretical setups and shows… ▽ More

    Submitted 2 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: Published as a conference paper at ICLR 2024; 40 pages, 4 figures

  21. arXiv:2311.13774  [pdf, other

    cs.LG stat.ML

    Learning Hierarchical Polynomials with Three-Layer Neural Networks

    Authors: Zihao Wang, Eshaan Nichani, Jason D. Lee

    Abstract: We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form $h = g \circ p$ where $p : \mathbb{R}^d \rightarrow \mathbb{R}$ is a degree $k$ polynomial and $g: \mathbb{R} \rightarrow \mathbb{R}$ is a degree $q$ polynomial. This function class generalizes the single-index mod… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: 57 pages

  22. arXiv:2311.11965  [pdf, other

    cs.LG stat.ML

    Provably Efficient CVaR RL in Low-rank MDPs

    Authors: Yulai Zhao, Wenhao Zhan, Xiaoyan Hu, Ho-fung Leung, Farzan Farnia, Wen Sun, Jason D. Lee

    Abstract: We study risk-sensitive Reinforcement Learning (RL), where we aim to maximize the Conditional Value at Risk (CVaR) with a fixed risk tolerance $τ$. Prior theoretical work studying risk-sensitive RL focuses on the tabular Markov Decision Processes (MDPs) setting. To extend CVaR RL to settings where state space is large, function approximation must be deployed. We study CVaR RL in low-rank MDPs with… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

    Comments: The first three authors contribute equally and are ordered randomly

  23. arXiv:2311.08252  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    REST: Retrieval-Based Speculative Decoding

    Authors: Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, Di He

    Abstract: We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that the process of text generation often includes certain common phases and patterns. Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieva… ▽ More

    Submitted 4 April, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

    Comments: NAACL 2024, camera ready

  24. arXiv:2307.13586  [pdf, ps, other

    cs.LG

    Settling the Sample Complexity of Online Reinforcement Learning

    Authors: Zihan Zhang, Yuxin Chen, Jason D. Lee, Simon S. Du

    Abstract: A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any… ▽ More

    Submitted 23 May, 2024; v1 submitted 25 July, 2023; originally announced July 2023.

  25. arXiv:2307.03381  [pdf, other

    cs.LG

    Teaching Arithmetic to Small Transformers

    Authors: Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as add… ▽ More

    Submitted 7 July, 2023; originally announced July 2023.

  26. arXiv:2307.02690  [pdf, other

    cs.CL cs.AI cs.LG

    Scaling In-Context Demonstrations with Structured Attention

    Authors: Tianle Cai, Kaixuan Huang, Jason D. Lee, Mengdi Wang

    Abstract: The recent surge of large language models (LLMs) highlights their ability to perform in-context learning, i.e., "learning" to perform a task from a few demonstrations in the context without any parameter updates. However, their capabilities of in-context learning are limited by the model architecture: 1) the use of demonstrations is constrained by a maximum sentence length due to positional embedd… ▽ More

    Submitted 5 July, 2023; originally announced July 2023.

  27. arXiv:2306.12383  [pdf, ps, other

    cs.LG stat.ML

    Sample Complexity for Quadratic Bandits: Hessian Dependent Bounds and Optimal Algorithms

    Authors: Qian Yu, Yining Wang, Baihe Huang, Qi Lei, Jason D. Lee

    Abstract: In stochastic zeroth-order optimization, a problem of practical relevance is understanding how to fully exploit the local geometry of the underlying objective function. We consider a fundamental setting in which the objective function is quadratic, and provide the first tight characterization of the optimal Hessian-dependent sample complexity. Our contribution is twofold. First, from an informatio… ▽ More

    Submitted 25 December, 2023; v1 submitted 21 June, 2023; originally announced June 2023.

  28. arXiv:2305.18505  [pdf, ps, other

    cs.LG cs.AI math.ST stat.ML

    Provable Reward-Agnostic Preference-Based Reinforcement Learning

    Authors: Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee

    Abstract: Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals. While PbRL has demonstrated practical success in fine-tuning language models, existing theoretical work focuses on regret minimization and fails to capture most of the practical frameworks. In t… ▽ More

    Submitted 17 April, 2024; v1 submitted 29 May, 2023; originally announced May 2023.

    Comments: ICLR 2024 Spotlight

  29. arXiv:2305.17608  [pdf, other

    cs.LG cs.AI cs.CL math.OC stat.ML

    Reward Collapse in Aligning Large Language Models

    Authors: Ziang Song, Tianle Cai, Jason D. Lee, Weijie J. Su

    Abstract: The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences, which are often represented as rankings of responses to prompts. In this paper, we document the phenomenon of \textit{reward collapse}, an empirical observation where the prevailing ranking-based approach results i… ▽ More

    Submitted 27 May, 2023; originally announced May 2023.

  30. arXiv:2305.17333  [pdf, other

    cs.LG cs.CL

    Fine-Tuning Language Models with Just Forward Passes

    Authors: Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, Sanjeev Arora

    Abstract: Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder opti… ▽ More

    Submitted 11 January, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted by NeurIPS 2023 (oral). Code available at https://github.com/princeton-nlp/MeZO

  31. arXiv:2305.14816  [pdf, ps, other

    cs.LG math.ST stat.ML

    Provable Offline Preference-Based Reinforcement Learning

    Authors: Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun

    Abstract: In this paper, we investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offl… ▽ More

    Submitted 29 September, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: The first two authors contribute equally

  32. arXiv:2305.11788  [pdf, other

    cs.LG stat.ML

    Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability

    Authors: **gfeng Wu, Vladimir Braverman, Jason D. Lee

    Abstract: Recent research has observed that in machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) [Cohen, et al., 2021], where the stepsizes are set to be large, resulting in non-monotonic losses induced by the GD iterates. This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS… ▽ More

    Submitted 15 October, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023 camera ready version

  33. arXiv:2305.10633  [pdf, other

    cs.LG cs.IT stat.ML

    Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models

    Authors: Alex Damian, Eshaan Nichani, Rong Ge, Jason D. Lee

    Abstract: We focus on the task of learning a single index model $σ(w^\star \cdot x)$ with respect to the isotropic Gaussian distribution in $d$ dimensions. Prior work has shown that the sample complexity of learning $w^\star$ is governed by the information exponent $k^\star$ of the link function $σ$, which is defined as the index of the first nonzero Hermite coefficient of $σ$. Ben Arous et al. (2021) showe… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

  34. arXiv:2305.10282  [pdf, ps, other

    cs.LG cs.IT math.ST stat.ML

    Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

    Authors: Gen Li, Wenhao Zhan, Jason D. Lee, Yuejie Chi, Yuxin Chen

    Abstract: This paper studies tabular reinforcement learning (RL) in the hybrid setting, which assumes access to both an offline dataset and online interactions with the unknown environment. A central question boils down to how to efficiently utilize online data collection to strengthen and complement the offline dataset and enable effective policy fine-tuning. Leveraging recent advances in reward-agnostic e… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

  35. arXiv:2305.06986  [pdf, other

    cs.LG stat.ML

    Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks

    Authors: Eshaan Nichani, Alex Damian, Jason D. Lee

    Abstract: One of the central questions in the theory of deep learning is to understand how neural networks learn hierarchical features. The ability of deep networks to extract salient features is crucial to both their outstanding generalization ability and the modern deep learning paradigm of pretraining and finetuneing. However, this feature learning process remains poorly understood from a theoretical per… ▽ More

    Submitted 31 October, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: v2: NeurIPS 2023 camera ready

  36. arXiv:2305.04819  [pdf, other

    cs.LG cs.GT cs.MA stat.ML

    Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning

    Authors: Yulai Zhao, Zhuoran Yang, Zhaoran Wang, Jason D. Lee

    Abstract: Policy optimization methods with function approximation are widely used in multi-agent reinforcement learning. However, it remains elusive how to design such algorithms with statistical guarantees. Leveraging a multi-agent performance difference lemma that characterizes the landscape of multi-agent policy optimization, we find that the localized action value function serves as an ideal descent dir… ▽ More

    Submitted 8 May, 2023; originally announced May 2023.

    Comments: ICML 2023

  37. arXiv:2303.03095  [pdf, other

    cs.GT cs.LG math.OC

    Can We Find Nash Equilibria at a Linear Rate in Markov Games?

    Authors: Zhuoqing Song, Jason D. Lee, Zhuoran Yang

    Abstract: We study decentralized learning in two-player zero-sum discounted Markov games where the goal is to design a policy optimization algorithm for either agent satisfying two properties. First, the player does not need to know the policy of the opponent to update its policy. Second, when both players adopt the algorithm, their joint policy converges to a Nash equilibrium of the game. To this end, we c… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: ICLR 2023

  38. arXiv:2302.11634  [pdf, ps, other

    cs.LG

    Provably Efficient Reinforcement Learning via Surprise Bound

    Authors: Hanlin Zhu, Ruosong Wang, Jason D. Lee

    Abstract: Value function approximation is important in modern reinforcement learning (RL) problems especially when the state space is (infinitely) large. Despite the importance and wide applicability of value function approximation, its theoretical understanding is still not as sophisticated as its empirical success, especially in the context of general function approximation. In this paper, we propose a pr… ▽ More

    Submitted 22 February, 2023; originally announced February 2023.

    Comments: 35 pages, AISTATS 2023

  39. arXiv:2302.04753  [pdf, other

    cs.LG stat.ML

    Efficient displacement convex optimization with particle gradient descent

    Authors: Hadi Daneshmand, Jason D. Lee, Chi **

    Abstract: Particle gradient descent, which uses particles to represent a probability measure and performs gradient descent on particles in parallel, is widely used to optimize functions of probability measures. This paper considers particle gradient descent with a finite number of particles and establishes its theoretical guarantees to optimize functions that are \emph{displacement convex} in measures. Conc… ▽ More

    Submitted 9 February, 2023; originally announced February 2023.

  40. arXiv:2302.02392  [pdf, ps, other

    cs.LG stat.ML

    Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage

    Authors: Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun

    Abstract: In offline reinforcement learning (RL) we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, taking the form of assuming some coverage, realizability, Bellman completeness, and/or hard margin (gap). In this work we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage… ▽ More

    Submitted 13 November, 2023; v1 submitted 5 February, 2023; originally announced February 2023.

    Comments: The original title of this paper was "Refined Value-Based Offline RL under Realizability and Partial Coverage," but it was later changed. This paper has been accepted for NeurIPS 2023

  41. arXiv:2301.13196  [pdf, other

    cs.LG cs.AI

    Looped Transformers as Programmable Computers

    Authors: Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, Dimitris Papailiopoulos

    Abstract: We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop. Our input sequence acts as a punchcard, consisting of instructions and memory for data read/writes. We demonstrate that a constant number of encoder layers can emulate basic computing blocks, including embedding edit operations, non-linear functions, fu… ▽ More

    Submitted 30 January, 2023; originally announced January 2023.

  42. arXiv:2301.11500  [pdf, other

    cs.LG math.OC stat.ML

    Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing

    Authors: Jikai **, Zhiyuan Li, Kaifeng Lyu, Simon S. Du, Jason D. Lee

    Abstract: It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem, whose goal is to recover a low-rank ground-truth matrix from near-isotropic linear measurements. It is shown that GD with small initialization behaves similarly to the gr… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

  43. arXiv:2212.03714  [pdf, other

    cs.LG cs.CR stat.ML

    Reconstructing Training Data from Model Gradient, Provably

    Authors: Zihan Wang, Jason D. Lee, Qi Lei

    Abstract: Understanding when and how much a model gradient leaks information about the training sample is an important question in privacy. In this paper, we present a surprising result: even without training or memorizing the data, we can fully reconstruct the training samples from a single gradient query at a randomly chosen parameter value. We prove the identifiability of the training data under mild con… ▽ More

    Submitted 10 June, 2023; v1 submitted 7 December, 2022; originally announced December 2022.

  44. arXiv:2210.06705  [pdf, ps, other

    cs.LG cs.AI math.OC

    From Gradient Flow on Population Loss to Learning with Stochastic Gradient Descent

    Authors: Satyen Kale, Jason D. Lee, Chris De Sa, Ayush Sekhari, Karthik Sridharan

    Abstract: Stochastic Gradient Descent (SGD) has been the method of choice for learning large-scale non-convex models. While a general analysis of when SGD works has been elusive, there has been a lot of recent progress in understanding the convergence of Gradient Flow (GF) on the population loss, partly due to the simplicity that a continuous-time analysis buys us. An overarching theme of our paper is provi… ▽ More

    Submitted 12 October, 2022; originally announced October 2022.

  45. arXiv:2209.15594  [pdf, other

    cs.LG cs.IT math.OC stat.ML

    Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability

    Authors: Alex Damian, Eshaan Nichani, Jason D. Lee

    Abstract: Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(θ)$, is bounded by $2/η$, training is "stable" and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen e… ▽ More

    Submitted 10 April, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: ICLR 2023, first two authors contributed equally

  46. arXiv:2207.05738  [pdf, other

    cs.LG

    PAC Reinforcement Learning for Predictive State Representations

    Authors: Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee

    Abstract: In this paper we study online Reinforcement Learning (RL) in partially observable dynamical systems. We focus on the Predictive State Representations (PSRs) model, which is an expressive model that captures other well-known models such as Partially Observable Markov Decision Processes (POMDP). PSR represents the states using a set of predictions of future observations and is defined entirely using… ▽ More

    Submitted 13 August, 2022; v1 submitted 12 July, 2022; originally announced July 2022.

  47. arXiv:2206.15144  [pdf, other

    cs.LG cs.IT stat.ML

    Neural Networks can Learn Representations with Gradient Descent

    Authors: Alex Damian, Jason D. Lee, Mahdi Soltanolkotabi

    Abstract: Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be… ▽ More

    Submitted 30 June, 2022; originally announced June 2022.

    Comments: COLT 2022

  48. arXiv:2206.12081  [pdf, other

    cs.LG stat.ME stat.ML

    Computationally Efficient PAC RL in POMDPs with Latent Determinism and Conditional Embeddings

    Authors: Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun

    Abstract: We study reinforcement learning with function approximation for large-scale Partially Observable Markov Decision Processes (POMDPs) where the state space and observation space are large or even continuous. Particularly, we consider Hilbert space embeddings of POMDP where the feature of latent states and the feature of observations admit a conditional Hilbert space embedding of the observation emis… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

  49. arXiv:2206.12020  [pdf, ps, other

    cs.LG math.ST stat.ME stat.ML

    Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems

    Authors: Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun

    Abstract: We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new \textit{Partially Observable Bilinear Actor-Critic framework}, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as we… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

  50. arXiv:2206.03688  [pdf, other

    cs.LG stat.ML

    Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials

    Authors: Eshaan Nichani, Yu Bai, Jason D. Lee

    Abstract: A recent goal in the theory of deep learning is to identify how neural networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning… ▽ More

    Submitted 26 November, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

    Comments: v2: NeurIPS 2022 camera ready version