Search | arXiv e-print repository

A Novel Catalyst Scheme for Stochastic Minimax Optimization

Abstract: This paper presents a proximal-point-based catalyst scheme for simple first-order methods applied to convex minimization and convex-concave minimax problems. In particular, for smooth and (strongly)-convex minimization problems, the proposed catalyst scheme, instantiated with a simple variant of stochastic gradient method, attains the optimal rate of convergence in terms of both deterministic and… ▽ More This paper presents a proximal-point-based catalyst scheme for simple first-order methods applied to convex minimization and convex-concave minimax problems. In particular, for smooth and (strongly)-convex minimization problems, the proposed catalyst scheme, instantiated with a simple variant of stochastic gradient method, attains the optimal rate of convergence in terms of both deterministic and stochastic errors. For smooth and strongly-convex-strongly-concave minimax problems, the catalyst scheme attains the optimal rate of convergence for deterministic and stochastic errors up to a logarithmic factor. To the best of our knowledge, this reported convergence seems to be attained for the first time by stochastic first-order methods in the literature. We obtain this result by designing and catalyzing a novel variant of stochastic extragradient method for solving smooth and strongly-monotone variational inequality, which may be of independent interest. △ Less

Submitted 7 November, 2023; v1 submitted 5 November, 2023; originally announced November 2023.

arXiv:2310.19807 [pdf, other]

Improved Communication Efficiency in Federated Natural Policy Gradient via ADMM-based Gradient Updates

Authors: Guangchen Lan, Han Wang, James Anderson, Christopher Brinton, Vaneet Aggarwal

Abstract: Federated reinforcement learning (FedRL) enables agents to collaboratively train a global policy without sharing their individual data. However, high communication overhead remains a critical bottleneck, particularly for natural policy gradient (NPG) methods, which are second-order. To address this issue, we propose the FedNPG-ADMM framework, which leverages the alternating direction method of mul… ▽ More Federated reinforcement learning (FedRL) enables agents to collaboratively train a global policy without sharing their individual data. However, high communication overhead remains a critical bottleneck, particularly for natural policy gradient (NPG) methods, which are second-order. To address this issue, we propose the FedNPG-ADMM framework, which leverages the alternating direction method of multipliers (ADMM) to approximate global NPG directions efficiently. We theoretically demonstrate that using ADMM-based gradient updates reduces communication complexity from ${O}({d^{2}})$ to ${O}({d})$ at each iteration, where $d$ is the number of model parameters. Furthermore, we show that achieving an $ε$-error stationary convergence requires ${O}(\frac{1}{(1-γ)^{2}ε})$ iterations for discount factor $γ$, demonstrating that FedNPG-ADMM maintains the same convergence rate as the standard FedNPG. Through evaluation of the proposed algorithms in MuJoCo environments, we demonstrate that FedNPG-ADMM maintains the reward performance of standard FedNPG, and that its convergence rate improves when the number of federated agents increases. △ Less

Submitted 9 October, 2023; originally announced October 2023.

Comments: Accepted at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

ACM Class: I.2.6

arXiv:2310.12139 [pdf, ps, other]

Optimal and parameter-free gradient minimization methods for convex and nonconvex optimization

Authors: Guanghui Lan, Yuyuan Ouyang, Zhe Zhang

Abstract: We propose novel optimal and parameter-free algorithms for computing an approximate solution with small (projected) gradient norm. Specifically, for computing an approximate solution such that the norm of its (projected) gradient does not exceed $\varepsilon$, we obtain the following results: a) for the convex case, the total number of gradient evaluations is bounded by… ▽ More We propose novel optimal and parameter-free algorithms for computing an approximate solution with small (projected) gradient norm. Specifically, for computing an approximate solution such that the norm of its (projected) gradient does not exceed $\varepsilon$, we obtain the following results: a) for the convex case, the total number of gradient evaluations is bounded by $O(1)\sqrt{L\|x_0 - x^*\|/\varepsilon}$, where $L$ is the Lipschitz constant of the gradient and $x^*$ is any optimal solution; b) for the strongly convex case, the total number of gradient evaluations is bounded by $O(1)\sqrt{L/μ}\log(\|\nabla f(x_0)\|/ε)$, where $μ$ is the strong convexity modulus; and c) for the nonconvex case, the total number of gradient evaluations is bounded by $O(1)\sqrt{Ll}(f(x_0) - f(x^*))/\varepsilon^2$, where $l$ is the lower curvature constant. Our complexity results match the lower complexity bounds of the convex and strongly cases, and achieve the above best-known complexity bound for the nonconvex case for the first time in the literature. Moreover, for all the convex, strongly convex, and nonconvex cases, we propose parameter-free algorithms that do not require the input of any problem parameters. To the best of our knowledge, there do not exist such parameter-free methods before especially for the strongly convex and nonconvex cases. Since most regularity conditions (e.g., strong convexity and lower curvature) are imposed over a global scope, the corresponding problem parameters are notoriously difficult to estimate. However, gradient norm minimization equips us with a convenient tool to monitor the progress of algorithms and thus the ability to estimate such parameters in-situ. △ Less

Submitted 29 November, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

arXiv:2310.10082 [pdf, other]

A simple uniformly optimal method without line search for convex optimization

Authors: Tianjiao Li, Guanghui Lan

Abstract: Line search (or backtracking) procedures have been widely employed into first-order methods for solving convex optimization problems, especially those with unknown problem parameters (e.g., Lipschitz constant). In this paper, we show that line search is superfluous in attaining the optimal rate of convergence for solving a convex optimization problem whose parameters are not given a priori. In par… ▽ More Line search (or backtracking) procedures have been widely employed into first-order methods for solving convex optimization problems, especially those with unknown problem parameters (e.g., Lipschitz constant). In this paper, we show that line search is superfluous in attaining the optimal rate of convergence for solving a convex optimization problem whose parameters are not given a priori. In particular, we present a novel accelerated gradient descent type algorithm called auto-conditioned fast gradient method (AC-FGM) that can achieve an optimal $\mathcal{O}(1/k^2)$ rate of convergence for smooth convex optimization without requiring the estimate of a global Lipschitz constant or the employment of line search procedures. We then extend AC-FGM to solve convex optimization problems with Hölder continuous gradients and show that it automatically achieves the optimal rates of convergence uniformly for all problem classes with the desired accuracy of the solution as the only input. Finally, we report some encouraging numerical results that demonstrate the advantages of AC-FGM over the previously developed parameter-free methods for convex optimization. △ Less

Submitted 26 October, 2023; v1 submitted 16 October, 2023; originally announced October 2023.

arXiv:2307.15890 [pdf, ps, other]

First-order Policy Optimization for Robust Policy Evaluation

Authors: Yan Li, Guanghui Lan

Abstract: We adopt a policy optimization viewpoint towards policy evaluation for robust Markov decision process with $\mathrm{s}$-rectangular ambiguity sets. The developed method, named first-order policy evaluation (FRPE), provides the first unified framework for robust policy evaluation in both deterministic (offline) and stochastic (online) settings, with either tabular representation or generic function… ▽ More We adopt a policy optimization viewpoint towards policy evaluation for robust Markov decision process with $\mathrm{s}$-rectangular ambiguity sets. The developed method, named first-order policy evaluation (FRPE), provides the first unified framework for robust policy evaluation in both deterministic (offline) and stochastic (online) settings, with either tabular representation or generic function approximation. In particular, we establish linear convergence in the deterministic setting, and $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity in the stochastic setting. FRPE also extends naturally to evaluating the robust state-action value function with $(\mathrm{s}, \mathrm{a})$-rectangular ambiguity sets. We discuss the application of the developed results for stochastic policy optimization of large-scale robust MDPs. △ Less

Submitted 29 July, 2023; originally announced July 2023.

arXiv:2307.01497 [pdf, other]

Accelerated stochastic approximation with state-dependent noise

Authors: Sasila Ilandarideva, Anatoli Juditsky, Guanghui Lan, Tianjiao Li

Abstract: We consider a class of stochastic smooth convex optimization problems under rather general assumptions on the noise in the stochastic gradient observation. As opposed to the classical problem setting in which the variance of noise is assumed to be uniformly bounded, herein we assume that the variance of stochastic gradients is related to the "sub-optimality" of the approximate solutions delivered… ▽ More We consider a class of stochastic smooth convex optimization problems under rather general assumptions on the noise in the stochastic gradient observation. As opposed to the classical problem setting in which the variance of noise is assumed to be uniformly bounded, herein we assume that the variance of stochastic gradients is related to the "sub-optimality" of the approximate solutions delivered by the algorithm. Such problems naturally arise in a variety of applications, in particular, in the well-known generalized linear regression problem in statistics. However, to the best of our knowledge, none of the existing stochastic approximation algorithms for solving this class of problems attain optimality in terms of the dependence on accuracy, problem parameters, and mini-batch size. We discuss two non-Euclidean accelerated stochastic approximation routines--stochastic accelerated gradient descent (SAGD) and stochastic gradient extrapolation (SGE)--which carry a particular duality relationship. We show that both SAGD and SGE, under appropriate conditions, achieve the optimal convergence rate, attaining the optimal iteration and sample complexities simultaneously. However, corresponding assumptions for the SGE algorithm are more general; they allow, for instance, for efficient application of the SGE to statistical estimation problems under heavy tail noises and discontinuous score functions. We also discuss the application of the SGE to problems satisfying quadratic growth conditions, and show how it can be used to recover sparse solutions. Finally, we report on some simulation experiments to illustrate numerical performance of our proposed algorithms in high-dimensional settings. △ Less

Submitted 13 July, 2023; v1 submitted 4 July, 2023; originally announced July 2023.

arXiv:2306.12116 [pdf, ps, other]

Mean square exponential stability of numerical methods for stochastic differential delay equations

Authors: Guangqiang Lan, Qi Liu

Abstract: Mean square exponential stability of $θ$-EM and modified truncated Euler-Maruyama (MTEM) methods for stochastic differential delay equations (SDDEs) are investigated in this paper. We present new criterion of mean square exponential stability of the $θ$-EM and MTEM methods for SDDEs, which are different from most existing results under Khasminskii-type conditions. Two examples are provided to supp… ▽ More Mean square exponential stability of $θ$-EM and modified truncated Euler-Maruyama (MTEM) methods for stochastic differential delay equations (SDDEs) are investigated in this paper. We present new criterion of mean square exponential stability of the $θ$-EM and MTEM methods for SDDEs, which are different from most existing results under Khasminskii-type conditions. Two examples are provided to support our conclusions. △ Less

Submitted 21 June, 2023; originally announced June 2023.

Comments: 19 pages

MSC Class: 65C30; 65C20; 65L05; 65L20

arXiv:2303.15672 [pdf, ps, other]

Numerical Methods for Convex Multistage Stochastic Optimization

Authors: Guanghui Lan, Alexander Shapiro

Abstract: Optimization problems involving sequential decisions in a stochastic environment were studied in Stochastic Programming (SP), Stochastic Optimal Control (SOC) and Markov Decision Processes (MDP). In this paper we mainly concentrate on SP and SOC modelling approaches. In these frameworks there are natural situations when the considered problems are convex. Classical approach to sequential optimizat… ▽ More Optimization problems involving sequential decisions in a stochastic environment were studied in Stochastic Programming (SP), Stochastic Optimal Control (SOC) and Markov Decision Processes (MDP). In this paper we mainly concentrate on SP and SOC modelling approaches. In these frameworks there are natural situations when the considered problems are convex. Classical approach to sequential optimization is based on dynamic programming. It has the problem of the so-called ``Curse of Dimensionality", in that its computational complexity increases exponentially with increase of dimension of state variables. Recent progress in solving convex multistage stochastic problems is based on cutting planes approximations of the cost-to-go (value) functions of dynamic programming equations. Cutting planes type algorithms in dynamical settings is one of the main topics of this paper. We also discuss Stochastic Approximation type methods applied to multistage stochastic optimization problems. From the computational complexity point of view, these two types of methods seem to be complimentary to each other. Cutting plane type methods can handle multistage problems with a large number of stages, but a relatively smaller number of state (decision) variables. On the other hand, stochastic approximation type methods can only deal with a small number of stages, but a large number of decision variables. △ Less

Submitted 27 March, 2023; originally announced March 2023.

MSC Class: 65K05; 90C15; 90C39; 90C40

arXiv:2303.04386 [pdf, ps, other]

Policy Mirror Descent Inherently Explores Action Space

Authors: Yan Li, Guanghui Lan

Abstract: Explicit exploration in the action space was assumed to be indispensable for online policy gradient methods to avoid a drastic degradation in sample complexity, for solving general reinforcement learning problems over finite state and action spaces. In this paper, we establish for the first time an $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity for online policy gradient methods without incorporat… ▽ More Explicit exploration in the action space was assumed to be indispensable for online policy gradient methods to avoid a drastic degradation in sample complexity, for solving general reinforcement learning problems over finite state and action spaces. In this paper, we establish for the first time an $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity for online policy gradient methods without incorporating any exploration strategies. The essential development consists of two new on-policy evaluation operators and a novel analysis of the stochastic policy mirror descent method (SPMD). SPMD with the first evaluation operator, called value-based estimation, tailors to the Kullback-Leibler divergence. Provided the Markov chains on the state space of generated policies are uniformly mixing with non-diminishing minimal visitation measure, an $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity is obtained with a linear dependence on the size of the action space. SPMD with the second evaluation operator, namely truncated on-policy Monte Carlo (TOMC), attains an $\tilde{\mathcal{O}}(\mathcal{H}_{\mathcal{D}}/ε^2)$ sample complexity, where $\mathcal{H}_{\mathcal{D}}$ mildly depends on the effective horizon and the size of the action space with properly chosen Bregman divergence (e.g., Tsallis divergence). SPMD with TOMC also exhibits stronger convergence properties in that it controls the optimality gap with high probability rather than in expectation. In contrast to explicit exploration, these new policy gradient methods can prevent repeatedly committing to potentially high-risk actions when searching for optimal policies. △ Less

Submitted 20 March, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

arXiv:2303.02024 [pdf, other]

Dual dynamic programming for stochastic programs over an infinite horizon

Authors: Caleb Ju, Guanghui Lan

Abstract: We consider a dual dynamic programming algorithm for solving stochastic programs over an infinite horizon. We show non-asymptotic convergence results when using an explorative strategy, and we then enhance this result by reducing the dependence of the effective planning horizon from quadratic to linear. This improvement is achieved by combining the forward and backward phases from dual dynamic pro… ▽ More We consider a dual dynamic programming algorithm for solving stochastic programs over an infinite horizon. We show non-asymptotic convergence results when using an explorative strategy, and we then enhance this result by reducing the dependence of the effective planning horizon from quadratic to linear. This improvement is achieved by combining the forward and backward phases from dual dynamic programming into a single iteration. We then apply our algorithms to a class of problems called hierarchical stationary stochastic programs, where the cost function is a stochastic multi-stage program. The hierarchical program can model problems with a hierarchy of decision-making, e.g., how long-term decisions influence day-to-day operations. We show that when the subproblems are solved inexactly via a dynamic stochastic approximation-type method, the resulting hierarchical dual dynamic programming can find approximately optimal solutions in finite time. Preliminary numerical results show the practical benefits of using the explorative strategy for solving the Brazilian hydro-thermal planning problem and economic dispatch, as well as the potential to exploit parallel computing. △ Less

Submitted 4 April, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

Comments: 45 pages. New experiments for hierarchical problem and writing updates

MSC Class: 90C25; 49M37; 90C06; 90C15; 93E20

arXiv:2212.00084 [pdf, other]

A model-free first-order method for linear quadratic regulator with $\tilde{O}(1/\varepsilon)$ sampling complexity

Authors: Caleb Ju, Georgios Kotsalis, Guanghui Lan

Abstract: We consider the classic stochastic linear quadratic regulator (LQR) problem under an infinite horizon average stage cost. By leveraging recent policy gradient methods from reinforcement learning, we obtain a first-order method that finds a stable feedback law whose objective function gap to the optima is at most $\varepsilon$ with high probability using $\tilde{O}(1/\varepsilon)$ samples, where… ▽ More We consider the classic stochastic linear quadratic regulator (LQR) problem under an infinite horizon average stage cost. By leveraging recent policy gradient methods from reinforcement learning, we obtain a first-order method that finds a stable feedback law whose objective function gap to the optima is at most $\varepsilon$ with high probability using $\tilde{O}(1/\varepsilon)$ samples, where $\tilde{O}$ hides polylogarithmic dependence on $\varepsilon$. Our proposed method seems to have the best dependence on $\varepsilon$ within the model-free literature without the assumption that all policies generated by the algorithm are stable almost surely, and it matches the best-known rate from the model-based literature, up to logarithmic factors. The improved dependence on $\varepsilon$ is achieved by showing the accuracy scales with the variance rather than the standard deviation of the gradient estimation error. Our developments that result in this improved sampling complexity fall in the category of actor-critic algorithms. The actor part involves a variational inequality formulation of the stochastic LQR problem, while in the critic part, we utilize a conditional stochastic primal-dual method and show that the algorithm has the optimal rate of convergence when paired with a shrinking multi-epoch scheme. △ Less

Submitted 10 May, 2023; v1 submitted 30 November, 2022; originally announced December 2022.

Comments: Pre-print. 23 pages, 1 figure. Update fixes some parts of proof that had incorrect constants and addresses stability of policy. Comments are welcome

MSC Class: 93C05; 65K05

arXiv:2211.16715 [pdf, ps, other]

Policy Optimization over General State and Action Spaces

Authors: Guanghui Lan

Abstract: Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging. In contrast to the tableau setting, one can not enumerate all the states and then iteratively update the policies for each state. This prevents the application of many well-studied RL methods especially those with provable convergence guarantees. In this paper, we first present a substantial gene… ▽ More Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging. In contrast to the tableau setting, one can not enumerate all the states and then iteratively update the policies for each state. This prevents the application of many well-studied RL methods especially those with provable convergence guarantees. In this paper, we first present a substantial generalization of the recently developed policy mirror descent method to deal with general state and action spaces. We introduce new approaches to incorporate function approximation into this method, so that we do not need to use explicit policy parameterization at all. Moreover, we present a novel policy dual averaging method for which possibly simpler function approximation techniques can be applied. We establish linear convergence rate to global optimality or sublinear convergence to stationarity for these methods applied to solve different classes of RL problems under exact policy evaluation. We then define proper notions of the approximation errors for policy evaluation and investigate their impact on the convergence of these methods applied to general-state RL problems with either finite-action or continuous-action spaces. To the best of our knowledge, the development of these algorithmic frameworks as well as their convergence analysis appear to be new in the literature. △ Less

Submitted 9 May, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

arXiv:2210.05807 [pdf, ps, other]

Solving Convex Smooth Function Constrained Optimization Is Almost As Easy As Unconstrained Optimization

Authors: Zhe Zhang, Guanghui Lan

Abstract: Consider applying first-order methods to solve the smooth convex constrained optimization problem of the form $\min_{x \in X} F(x).$ For a simple closed convex set $X$ which is easy to project onto, Nesterov proposed the Accelerated Gradient Descent (AGD) method to solve the constrained problem as efficiently as an unconstrained problem in terms of the number of gradient computations of $F$ (i.e.,… ▽ More Consider applying first-order methods to solve the smooth convex constrained optimization problem of the form $\min_{x \in X} F(x).$ For a simple closed convex set $X$ which is easy to project onto, Nesterov proposed the Accelerated Gradient Descent (AGD) method to solve the constrained problem as efficiently as an unconstrained problem in terms of the number of gradient computations of $F$ (i.e., oracle complexity). For a more complicated $\mathcal{X}$ described by function constraints, i.e., $\mathcal{X} = \{x \in X: g(x) \leq 0\}$, where the projection onto $\mathcal{X}$ is not possible, it is an open question whether the function constrained problem can be solved as efficiently as an unconstrained problem in terms of the number of gradient computations for $F$ and $g$. In this paper, we provide an affirmative answer to the question by proposing a single-loop Accelerated Constrained Gradient Descent (ACGD) method. The ACGD method modifies the AGD method by changing the descent step to a constrained descent step, which adds only a few linear constraints to the prox map**. It enjoys almost the same oracle complexity as the optimal one for minimizing the optimal Lagrangian function, i.e., the Lagrangian multiplier $λ$ being fixed to the optimal multiplier $λ^*$. These upper oracle complexity bounds are shown to be unimprovable under a certain optimality regime with new lower oracle complexity bounds. To enhance its efficiency for large-scale problems with many function constraints, we introduce an ACGD with Sliding (ACGD-S) method which replaces the possibly computationally demanding constrained descent step with a sequence of basic matrix-vector multiplications. The ACGD-S method shares the same oracle complexity as the ACGD method, and its computation complexity, measured by the number of matrix-vector multiplications, is also unimprovable. △ Less

Submitted 2 November, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

arXiv:2210.05108 [pdf, ps, other]

Functional Constrained Optimization for Risk Aversion and Sparsity Control

Authors: Yi Cheng, Guanghui Lan, H. Edwin Romeijn

Abstract: Risk and sparsity requirements often need to be enforced simultaneously in many applications, e.g., in portfolio optimization, assortment planning, and treatment planning. Properly balancing these potentially conflicting requirements entails the formulation of functional constrained optimization with either convex or nonconvex objectives. In this paper, we focus on projection-free methods that can… ▽ More Risk and sparsity requirements often need to be enforced simultaneously in many applications, e.g., in portfolio optimization, assortment planning, and treatment planning. Properly balancing these potentially conflicting requirements entails the formulation of functional constrained optimization with either convex or nonconvex objectives. In this paper, we focus on projection-free methods that can generate a sparse trajectory for solving these challenging functional constrained optimization problems. Specifically, for the convex setting, we propose a Level Conditional Gradient (LCG) method, which leverages a level-set framework to update the approximation of the optimal value and an inner conditional gradient oracle (CGO) for solving mini-max subproblems. We show that the method achieves $\mathcal{O}\big(\frac{1}{ε^2}\log\frac{1}ε\big)$ iteration complexity for solving both smooth and nonsmooth cases without dependency on a possibly large size of optimal dual Lagrange multiplier. For the nonconvex setting, we introduce the Level Inexact Proximal Point (IPP-LCG) method and the Direct Nonconvex Conditional Gradient (DNCG) method. The first approach taps into the advantage of LCG by transforming the problem into a series of convex subproblems and exhibits an $\mathcal{O}\big(\frac{1}{ε^3}\log\frac{1}ε\big)$ iteration complexity for finding an ($ε,ε$)-KKT point. The DNCG is the first single-loop projection-free method, with iteration complexity bounded by $\mathcal{O}\big(1/ε^4\big)$ for computing a so-called $ε$-Wolfe point. We demonstrate the effectiveness of LCG, IPP-LCG and DNCG by devising formulations and conducting numerical experiments on two risk averse sparse optimization applications: a portfolio selection problem with and without cardinality requirement, and a radiation therapy planning problem in healthcare. △ Less

Submitted 10 October, 2022; originally announced October 2022.

arXiv:2209.12111 [pdf, other]

Convergence and exponential stability of modified truncated Milstein method for stochastic differential equations

Authors: Yu Jiang, Guangqiang Lan

Abstract: In this paper, we develop a new explicit scheme called modified truncated Milstein method which is motivated by truncated Milstein method proposed by Guo (2018) and modified truncated Euler-Maruyama method introduced by Lan (2018). We obtain the strong convergence of the scheme under local boundedness and Khasminskii-type conditions, which are relatively weaker than the existing results, and we pr… ▽ More In this paper, we develop a new explicit scheme called modified truncated Milstein method which is motivated by truncated Milstein method proposed by Guo (2018) and modified truncated Euler-Maruyama method introduced by Lan (2018). We obtain the strong convergence of the scheme under local boundedness and Khasminskii-type conditions, which are relatively weaker than the existing results, and we prove that the convergence rate could be arbitrarily close to 1 under given conditions. Moreover, exponential stability of the scheme is also considered while it is impossible for truncated Milstein method introduced in Guo(2018). Three numerical experiments are offered to support our conclusions. △ Less

Submitted 24 September, 2022; originally announced September 2022.

Comments: 28pages, 6 figures

MSC Class: 65C30; 65C20; 65L05; 65L20

arXiv:2209.10579 [pdf, other]

First-order Policy Optimization for Robust Markov Decision Process

Authors: Yan Li, Guanghui Lan, Tuo Zhao

Abstract: We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncertainties, and thus encompasses the standard MDP planning as a special case. For… ▽ More We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncertainties, and thus encompasses the standard MDP planning as a special case. For $(\mathbf{s},\mathbf{a})$-rectangular uncertainty sets, we establish several structural observations on the robust objective, which facilitates the development of a policy-based first-order method, namely the robust policy mirror descent (RPMD). An $\mathcal{O}(\log(1/ε))$ iteration complexity for finding an $ε$-optimal policy is established with linearly increasing stepsizes. We further develop a stochastic variant of the robust policy mirror descent method, named SRPMD, when the first-order information is only available through online interactions with the nominal environment. We show that the optimality gap converges linearly up to the noise level, and consequently establish an $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity by develo** a temporal difference learning method for policy evaluation. Both iteration and sample complexities are also discussed for RPMD with a constant stepsize. To the best of our knowledge, all the aforementioned results appear to be new for policy-based first-order methods applied to the robust MDP problem. △ Less

Submitted 10 June, 2023; v1 submitted 21 September, 2022; originally announced September 2022.

arXiv:2205.08011 [pdf, other]

Level Constrained First Order Methods for Function Constrained Optimization

Authors: Digvijay Boob, Qi Deng, Guanghui Lan

Abstract: We present a new feasible proximal gradient method for constrained optimization where both the objective and constraint functions are given by the summation of a smooth, possibly nonconvex function and a convex simple function. The algorithm converts the original problem into a sequence of convex subproblems. Formulating those subproblems requires the evaluation of at most one gradient value of th… ▽ More We present a new feasible proximal gradient method for constrained optimization where both the objective and constraint functions are given by the summation of a smooth, possibly nonconvex function and a convex simple function. The algorithm converts the original problem into a sequence of convex subproblems. Formulating those subproblems requires the evaluation of at most one gradient value of the original objective and constraint functions. Either exact or approximate subproblem solutions can be computed efficiently in many cases. An important feature of the algorithm is the constraint level parameter. By carefully increasing this level for each subproblem, we provide a simple solution to overcome the challenge of bounding the Lagrangian multipliers and show that the algorithm follows a strictly feasible solution path till convergence to the stationary point. We develop a simple, proximal gradient descent type analysis, showing that the complexity bound of this new algorithm is comparable to gradient descent for the unconstrained setting, which is new in the literature. Exploiting this new design and analysis technique, we extend our algorithms to some more challenging constrained optimization problems where 1) the objective is a stochastic or finite-sum function, and 2) structured nonsmooth functions replace smooth components of both objective and constraint functions. Complexity results for these problems also seem to be new in the literature. Finally, our method can also be applied to convex function-constrained problems where we show complexities similar to the proximal gradient method. △ Less

Submitted 31 January, 2024; v1 submitted 16 May, 2022; originally announced May 2022.

Comments: Accepted at Mathematical Programming

MSC Class: 90C26; 90C30; 90C06; 90C51; 49M37

arXiv:2205.05800 [pdf, other]

Stochastic first-order methods for average-reward Markov decision processes

Authors: Tianjiao Li, Feiyang Wu, Guanghui Lan

Abstract: We study the problem of average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy evaluation and optimization. Existing on-policy evaluation methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies, e.g., deterministic policies, for lack of exploration. To remedy t… ▽ More We study the problem of average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy evaluation and optimization. Existing on-policy evaluation methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies, e.g., deterministic policies, for lack of exploration. To remedy these issues, we develop a novel variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with sharp convergence guarantees, and an exploratory variance-reduced temporal difference (EVRTD) method for insufficiently random policies with comparable convergence guarantees. We further establish linear convergence rate on the bias of policy evaluation, which is essential for improving the overall sample complexity of policy optimization. On the other hand, compared with intensive research interest in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions on the underlying Markov processes (see, e.g., Abbasi-Yadkori et al., 2019), and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward variant of the stochastic policy mirror descent (SPMD) (Lan, 2022). We establish the first $\widetilde{\mathcal{O}}(ε^{-2})$ sample complexity for solving AMDPs with policy gradient method under both the generative model (with unichain assumption) and Markovian noise model (with ergodic assumption). This bound can be further improved to $\widetilde{\mathcal{O}}(ε^{-1})$ for solving regularized AMDPs. Our theoretical advantages are corroborated by numerical experiments. △ Less

Submitted 14 September, 2022; v1 submitted 11 May, 2022; originally announced May 2022.

arXiv:2203.05117 [pdf, other]

Optimal Methods for Convex Risk Averse Distributed Optimization

Authors: Guanghui Lan, Zhe Zhang

Abstract: This paper studies the communication complexity of convex risk-averse optimization over a network. The problem generalizes the well-studied risk-neutral finite-sum distributed optimization problem and its importance stems from the need to handle risk in an uncertain environment. For algorithms in the literature, there exists a gap in communication complexities for solving risk-averse and risk-neut… ▽ More This paper studies the communication complexity of convex risk-averse optimization over a network. The problem generalizes the well-studied risk-neutral finite-sum distributed optimization problem and its importance stems from the need to handle risk in an uncertain environment. For algorithms in the literature, there exists a gap in communication complexities for solving risk-averse and risk-neutral problems. We propose two distributed algorithms, namely the distributed risk averse optimization (DRAO) method and the distributed risk averse optimization with sliding (DRAO-S) method, to close the gap. Specifically, the DRAO method achieves the optimal communication complexity by assuming a certain saddle point subproblem can be easily solved in the server node. The DRAO-S method removes the strong assumption by introducing a novel saddle point sliding subroutine which only requires the projection over the ambiguity set $P$. We observe that the number of $P$-projections performed by DRAO-S is optimal. Moreover, we develop matching lower complexity bounds to show the communication complexities of both DRAO and DRAO-S to be improvable. Numerical experiments are conducted to demonstrate the encouraging empirical performance of the DRAO-S method. △ Less

Submitted 7 March, 2023; v1 submitted 9 March, 2022; originally announced March 2022.

arXiv:2202.07868 [pdf, ps, other]

Data-Driven Minimax Optimization with Expectation Constraints

Authors: Shuoguang Yang, Xudong Li, Guanghui Lan

Abstract: Attention to data-driven optimization approaches, including the well-known stochastic gradient descent method, has grown significantly over recent decades, but data-driven constraints have rarely been studied, because of the computational challenges of projections onto the feasible set defined by these hard constraints. In this paper, we focus on the non-smooth convex-concave stochastic minimax re… ▽ More Attention to data-driven optimization approaches, including the well-known stochastic gradient descent method, has grown significantly over recent decades, but data-driven constraints have rarely been studied, because of the computational challenges of projections onto the feasible set defined by these hard constraints. In this paper, we focus on the non-smooth convex-concave stochastic minimax regime and formulate the data-driven constraints as expectation constraints. The minimax expectation constrained problem subsumes a broad class of real-world applications, including two-player zero-sum game and data-driven robust optimization. We propose a class of efficient primal-dual algorithms to tackle the minimax expectation-constrained problem, and show that our algorithms converge at the optimal rate of $\mathcal{O}(\frac{1}{\sqrt{N}})$. We demonstrate the practical efficiency of our algorithms by conducting numerical experiments on large-scale real-world applications. △ Less

Submitted 9 October, 2023; v1 submitted 16 February, 2022; originally announced February 2022.

arXiv:2201.09457 [pdf, other]

Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity

Authors: Yan Li, Guanghui Lan, Tuo Zhao

Abstract: We propose a new policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. We first establish the global linear conve… ▽ More We propose a new policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. We first establish the global linear convergence of HPMD instantiated with Kullback-Leibler divergence, for both the optimality gap, and a weighted distance to the set of optimal policies. Then local superlinear convergence is obtained for both quantities without any assumption. With local acceleration and diminishing regularization, we establish the first result among policy gradient methods on certifying and characterizing the limiting policy, by showing, with a non-asymptotic characterization, that the last-iterate policy converges to the unique optimal policy with the maximal entropy. We then extend all the aforementioned results to HPMD instantiated with a broad class of decomposable Bregman divergences, demonstrating the generality of the these computational properties. As a by product, we discover the finite-time exact convergence for some commonly used Bregman divergences, implying the continuing convergence of HPMD to the limiting policy even if the current policy is already optimal. Finally, we develop a stochastic version of HPMD and establish similar convergence properties. By exploiting the local acceleration, we show that for small optimality gap, a better than $\tilde{\mathcal{O}}(\left|\mathcal{S}\right| \left|\mathcal{A}\right| / ε^2)$ sample complexity holds with high probability, when assuming a generative model for policy evaluation. △ Less

Submitted 29 November, 2022; v1 submitted 23 January, 2022; originally announced January 2022.

arXiv:2201.05756 [pdf, other]

Block Policy Mirror Descent

Authors: Guanghui Lan, Yan Li, Tuo Zhao

Abstract: In this paper, we present a new policy gradient (PG) methods, namely the block policy mirror descent (BPMD) method for solving a class of regularized reinforcement learning (RL) problems with (strongly)-convex regularizers. Compared to the traditional PG methods with a batch update rule, which visits and updates the policy for every state, BPMD method has cheap per-iteration computation via a part… ▽ More In this paper, we present a new policy gradient (PG) methods, namely the block policy mirror descent (BPMD) method for solving a class of regularized reinforcement learning (RL) problems with (strongly)-convex regularizers. Compared to the traditional PG methods with a batch update rule, which visits and updates the policy for every state, BPMD method has cheap per-iteration computation via a partial update rule that performs the policy update on a sampled state. Despite the nonconvex nature of the problem and a partial update rule, we provide a unified analysis for several sampling schemes, and show that BPMD achieves fast linear convergence to the global optimality. In particular, uniform sampling leads to comparable worst-case total computational complexity as batch PG methods. A necessary and sufficient condition for convergence with on-policy sampling is also identified. With a hybrid sampling scheme, we further show that BPMD enjoys potential instance-dependent acceleration, leading to improved dependence on the state space and consequently outperforming batch PG methods. We then extend BPMD methods to the stochastic setting, by utilizing stochastic first-order information constructed from samples. With a generative model, $\tilde{\mathcal{O}}(\left\lvert \mathcal{S}\right\rvert \left\lvert \mathcal{A}\right\rvert /ε)$ (resp. $\tilde{\mathcal{O}}(\left\lvert \mathcal{S}\right\rvert \left\lvert \mathcal{A} \right\rvert /ε^2)$) sample complexities are established for the strongly-convex (resp. non-strongly-convex) regularizers, where $ε$ denotes the target accuracy. To the best of our knowledge, this is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning, which provides a new perspective on solving large-scale RL problems. △ Less

Submitted 17 September, 2022; v1 submitted 14 January, 2022; originally announced January 2022.

MSC Class: 90C40; 90C15; 90C26; 68Q25

arXiv:2112.13109 [pdf, other]

Accelerated and instance-optimal policy evaluation with linear function approximation

Authors: Tianjiao Li, Guanghui Lan, Ashwin Pananjady

Abstract: We study the problem of policy evaluation with linear function approximation and present efficient and practical algorithms that come with strong optimality guarantees. We begin by proving lower bounds that establish baselines on both the deterministic error and stochastic error in this problem. In particular, we prove an oracle complexity lower bound on the deterministic error in an instance-depe… ▽ More We study the problem of policy evaluation with linear function approximation and present efficient and practical algorithms that come with strong optimality guarantees. We begin by proving lower bounds that establish baselines on both the deterministic error and stochastic error in this problem. In particular, we prove an oracle complexity lower bound on the deterministic error in an instance-dependent norm associated with the stationary distribution of the transition kernel, and use the local asymptotic minimax machinery to prove an instance-dependent lower bound on the stochastic error in the i.i.d. observation model. Existing algorithms fail to match at least one of these lower bounds: To illustrate, we analyze a variance-reduced variant of temporal difference learning, showing in particular that it fails to achieve the oracle complexity lower bound. To remedy this issue, we develop an accelerated, variance-reduced fast temporal difference algorithm (VRFTD) that simultaneously matches both lower bounds and attains a strong notion of instance-optimality. Finally, we extend the VRFTD algorithm to the setting with Markovian observations, and provide instance-dependent convergence results. Our theoretical guarantees of optimality are corroborated by numerical experiments. △ Less

Submitted 13 August, 2022; v1 submitted 24 December, 2021; originally announced December 2021.

arXiv:2111.00996 [pdf, ps, other]

Mirror-prox sliding methods for solving a class of monotone variational inequalities

Authors: Guanghui Lan, Yuyuan Ouyang

Abstract: In this paper we propose new algorithms for solving a class of structured monotone variational inequality (VI) problems over compact feasible sets. By identifying the gradient components existing in the operator of VI, we show that it is possible to skip computations of the gradients from time to time, while still maintaining the optimal iteration complexity for solving these VI problems. Specific… ▽ More In this paper we propose new algorithms for solving a class of structured monotone variational inequality (VI) problems over compact feasible sets. By identifying the gradient components existing in the operator of VI, we show that it is possible to skip computations of the gradients from time to time, while still maintaining the optimal iteration complexity for solving these VI problems. Specifically, for deterministic VI problems involving the sum of the gradient of a smooth convex function $\nabla G$ and a monotone operator $H$, we propose a new algorithm, called the mirror-prox sliding method, which is able to compute an $\varepsilon$-approximate weak solution with at most $O((L/\varepsilon)^{1/2})$ evaluations of $\nabla G$ and $O((L/\varepsilon)^{1/2}+M/\varepsilon)$ evaluations of $H$, where $L$ and $M$ are Lipschitz constants of $\nabla G$ and $H$, respectively. Moreover, for the case when the operator $H$ can only be accessed through its stochastic estimators, we propose a stochastic mirror-prox sliding method that can compute a stochastic $\varepsilon$-approximate weak solution with at most $O((L/\varepsilon)^{1/2})$ evaluations of $\nabla G$ and $O((L/\varepsilon)^{1/2}+M/\varepsilon + σ^2/\varepsilon^2)$ samples of $H$, where $σ$ is the variance of the stochastic samples of $H$. △ Less

Submitted 1 November, 2021; originally announced November 2021.

arXiv:2110.10351 [pdf, other]

Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process

Authors: Tianjiao Li, Ziwei Guan, Shaofeng Zou, Tengyu Xu, Yingbin Liang, Guanghui Lan

Abstract: The problem of constrained Markov decision process (CMDP) is investigated, where an agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its utilities/costs. A new primal-dual approach is proposed with a novel integration of three ingredients: entropy regularized policy optimizer, dual variable regularizer, and Nesterov's accelerated gradient descent… ▽ More The problem of constrained Markov decision process (CMDP) is investigated, where an agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its utilities/costs. A new primal-dual approach is proposed with a novel integration of three ingredients: entropy regularized policy optimizer, dual variable regularizer, and Nesterov's accelerated gradient descent dual optimizer, all of which are critical to achieve a faster convergence. The finite-time error bound of the proposed approach is characterized. Despite the challenge of the nonconcave objective subject to nonconcave constraints, the proposed approach is shown to converge to the global optimum with a complexity of $\tilde{\mathcal O}(1/ε)$ in terms of the optimality gap and the constraint violation, which improves the complexity of the existing primal-dual approach by a factor of $\mathcal O(1/ε)$ \citep{ding2020natural,paternain2019constrained}. This is the first demonstration that nonconcave CMDP problems can attain the complexity lower bound of $\mathcal O(1/ε)$ for convex optimization subject to convex constraints. Our primal-dual approach and non-asymptotic analysis are agnostic to the RL optimizer used, and thus are more flexible for practical applications. More generally, our approach also serves as the first algorithm that provably accelerates constrained nonconvex optimization with zero duality gap by exploiting the geometries such as the gradient dominance condition, for which the existing acceleration methods for constrained convex optimization are not applicable. △ Less

Submitted 19 October, 2021; originally announced October 2021.

Comments: The paper was initially submitted for publication in January 2021

arXiv:2110.04844 [pdf, other]

Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits

Authors: Yan Li, Dhruv Choudhary, Xiaohan Wei, Baichuan Yuan, Bhargav Bhushanam, Tuo Zhao, Guanghui Lan

Abstract: Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains. To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD, largely accredited to their token-dependent learning rate. However, the underlying mechanism for the efficiency of token-dependent lear… ▽ More Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains. To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD, largely accredited to their token-dependent learning rate. However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored. We show that incorporating frequency information of tokens in the embedding learning problems leads to provably efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent. Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced. Empirically, we show the proposed algorithms are able to improve or match adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system, closing the performance gap between SGD and adaptive algorithms. Our results are the first to show token-dependent learning rate provably improves convergence for non-convex embedding learning problems. △ Less

Submitted 23 November, 2021; v1 submitted 10 October, 2021; originally announced October 2021.

Comments: Additional experiments on Word2Vec embedding learning included

arXiv:2102.00135 [pdf, ps, other]

Policy Mirror Descent for Reinforcement Learning: Linear Convergence, New Sampling Complexity, and Generalized Problem Classes

Authors: Guanghui Lan

Abstract: We present new policy mirror descent (PMD) methods for solving reinforcement learning (RL) problems with either strongly convex or general convex regularizers. By exploring the structural properties of these overall highly nonconvex problems we show that the PMD methods exhibit fast linear rate of convergence to the global optimality. We develop stochastic counterparts of these methods, and establ… ▽ More We present new policy mirror descent (PMD) methods for solving reinforcement learning (RL) problems with either strongly convex or general convex regularizers. By exploring the structural properties of these overall highly nonconvex problems we show that the PMD methods exhibit fast linear rate of convergence to the global optimality. We develop stochastic counterparts of these methods, and establish an ${\cal O}(1/ε)$ (resp., ${\cal O}(1/ε^2)$) sampling complexity for solving these RL problems with strongly (resp., general) convex regularizers using different sampling schemes, where $ε$ denote the target accuracy. We further show that the complexity for computing the gradients of these regularizers, if necessary, can be bounded by ${\cal O}\{(\log_γε) [(1-γ)L/μ]^{1/2}\log (1/ε)\}$ (resp., ${\cal O} \{(\log_γε) (L/ε)^{1/2}\}$)for problems with strongly (resp., general) convex regularizers. Here $γ$ denotes the discounting factor. To the best of our knowledge, these complexity bounds, along with our algorithmic developments, appear to be new in both optimization and RL literature. The introduction of these convex regularizers also greatly expands the flexibility and applicability of RL models. △ Less

Submitted 6 April, 2022; v1 submitted 29 January, 2021; originally announced February 2021.

arXiv:2101.00143 [pdf, other]

Graph topology invariant gradient and sampling complexity for decentralized and stochastic optimization

Authors: Guanghui Lan, Yuyuan Ouyang, Yi Zhou

Abstract: One fundamental problem in decentralized multi-agent optimization is the trade-off between gradient/sampling complexity and communication complexity. We propose new algorithms whose gradient and sampling complexities are graph topology invariant while their communication complexities remain optimal. For convex smooth deterministic problems, we propose a primal dual sliding (PDS) algorithm that com… ▽ More One fundamental problem in decentralized multi-agent optimization is the trade-off between gradient/sampling complexity and communication complexity. We propose new algorithms whose gradient and sampling complexities are graph topology invariant while their communication complexities remain optimal. For convex smooth deterministic problems, we propose a primal dual sliding (PDS) algorithm that computes an $ε$-solution with $O((\tilde{L}/ε)^{1/2})$ gradient and $O((\tilde{L}/ε)^{1/2}+\|\mathcal{A}\|/ε)$ communication complexities, where $\tilde{L}$ is the smoothness parameter of the objective and $\mathcal{A}$ is related to either the graph Laplacian or the transpose of the oriented incidence matrix of the communication network. The results can be improved to $O((\tilde{L}/μ)^{1/2}\log(1/ε))$ and $O((\tilde{L}/μ)^{1/2}\log(1/ε) + \|\mathcal{A}\|/ε^{1/2})$ respectively with $μ$-strong convexity. We also propose a stochastic variant, the primal dual sliding (SPDS) algorithm for problems with stochastic gradients. The SPDS algorithm utilizes the mini-batch technique and enables the agents to perform sampling and communication simultaneously. It computes a stochastic $ε$-solution with $O((\tilde{L}/ε)^{1/2} + (σ/ε)^2)$ sampling complexity, which can be improved to $O((\tilde{L}/μ)^{1/2}\log(1/ε) + σ^2/ε)$ with strong convexity. Here $σ^2$ is the variance. The communication complexities of SPDS remain the same as that of the deterministic case. All the aforementioned gradient and sampling complexities match the lower complexity bounds for centralized convex smooth optimization and are independent of the network structure. To the best of our knowledge, these gradient and sampling complexities have not been obtained before for decentralized optimization over a constraint feasible set. △ Less

Submitted 12 January, 2021; v1 submitted 31 December, 2020; originally announced January 2021.

Comments: 25 pages, 1 figure

arXiv:2011.10076 [pdf, other]

Optimal Algorithms for Convex Nested Stochastic Composite Optimization

Authors: Zhe Zhang, Guanghui Lan

Abstract: Recently, convex nested stochastic composite optimization (NSCO) has received considerable attention for its applications in reinforcement learning and risk-averse optimization. The current NSCO algorithms have worse stochastic oracle complexities, by orders of magnitude, than those for simpler stochastic composite optimization problems (e.g., sum of smooth and nonsmooth functions) without the nes… ▽ More Recently, convex nested stochastic composite optimization (NSCO) has received considerable attention for its applications in reinforcement learning and risk-averse optimization. The current NSCO algorithms have worse stochastic oracle complexities, by orders of magnitude, than those for simpler stochastic composite optimization problems (e.g., sum of smooth and nonsmooth functions) without the nested structure. Moreover, they require all outer-layer functions to be smooth, which is not satisfied by some important applications. These discrepancies prompt us to ask: ``does the nested composition make stochastic optimization more difficult in terms of the order of oracle complexity?" In this paper, we answer the question by develo** order-optimal algorithms for the convex NSCO problem constructed from an arbitrary composition of smooth, structured non-smooth and general non-smooth layer functions. When all outer-layer functions are smooth, we propose a stochastic sequential dual (SSD) method to achieve an oracle complexity of $\mathcal{O}(1/ε^2)$ ($\mathcal{O}(1/ε)$) when the problem is non-strongly (strongly) convex. When there exists some structured non-smooth or general non-smooth outer-layer function, we propose a nonsmooth stochastic sequential dual (nSSD) method to achieve an oracle complexity of $\mathcal{O}(1/ε^2)$. We provide a lower complexity bound to show the latter $\mathcal{O}(1/ε^2)$ complexity to be unimprovable even under a strongly convex setting. All these complexity results seem to be new in the literature and they indicate that the convex NSCO problem has the same order of oracle complexity as those without the nested composition in all but the strongly convex and outer-non-smooth problem. △ Less

Submitted 21 June, 2022; v1 submitted 19 November, 2020; originally announced November 2020.

arXiv:2011.08434 [pdf, other]

Simple and optimal methods for stochastic variational inequalities, II: Markovian noise and policy evaluation in reinforcement learning

Authors: Georgios Kotsalis, Guanghui Lan, Tianjiao Li

Abstract: The focus of this paper is on stochastic variational inequalities (VI) under Markovian noise. A prominent application of our algorithmic developments is the stochastic policy evaluation problem in reinforcement learning. Prior investigations in the literature focused on temporal difference (TD) learning by employing nonsmooth finite time analysis motivated by stochastic subgradient descent leading… ▽ More The focus of this paper is on stochastic variational inequalities (VI) under Markovian noise. A prominent application of our algorithmic developments is the stochastic policy evaluation problem in reinforcement learning. Prior investigations in the literature focused on temporal difference (TD) learning by employing nonsmooth finite time analysis motivated by stochastic subgradient descent leading to certain limitations. These encompass the requirement of analyzing a modified TD algorithm that involves projection to an a-priori defined Euclidean ball, achieving a non-optimal convergence rate and no clear way of deriving the beneficial effects of parallel implementation. Our approach remedies these shortcomings in the broader context of stochastic VIs and in particular when it comes to stochastic policy evaluation. We developed a variety of simple TD learning type algorithms motivated by its original version that maintain its simplicity, while offering distinct advantages from a non-asymptotic analysis point of view. We first provide an improved analysis of the standard TD algorithm that can benefit from parallel implementation. Then we present versions of a conditional TD algorithm (CTD), that involves periodic updates of the stochastic iterates, which reduce the bias and therefore exhibit improved iteration complexity. This brings us to the fast TD (FTD) algorithm which combines elements of CTD and the stochastic operator extrapolation method of the companion paper. For a novel index resetting policy FTD exhibits the best known convergence rate. We also devised a robust version of the algorithm that is particularly suitable for discounting factors close to 1. △ Less

Submitted 13 August, 2021; v1 submitted 14 November, 2020; originally announced November 2020.

Comments: arXiv admin note: text overlap with arXiv:2011.02987

MSC Class: 90C25; 90C15; 62L20; 68Q25

arXiv:2011.02987 [pdf, other]

Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation

Authors: Georgios Kotsalis, Guanghui Lan, Tianjiao Li

Abstract: In this paper we first present a novel operator extrapolation (OE) method for solving deterministic variational inequality (VI) problems. Similar to the gradient (operator) projection method, OE updates one single search sequence by solving a single projection subproblem in each iteration. We show that OE can achieve the optimal rate of convergence for solving a variety of VI problems in a much si… ▽ More In this paper we first present a novel operator extrapolation (OE) method for solving deterministic variational inequality (VI) problems. Similar to the gradient (operator) projection method, OE updates one single search sequence by solving a single projection subproblem in each iteration. We show that OE can achieve the optimal rate of convergence for solving a variety of VI problems in a much simpler way than existing approaches. We then introduce the stochastic operator extrapolation (SOE) method and establish its optimal convergence behavior for solving different stochastic VI problems. In particular, SOE achieves the optimal complexity for solving a fundamental problem, i.e., stochastic smooth and strongly monotone VI, for the first time in the literature. We also present a stochastic block operator extrapolations (SBOE) method to further reduce the iteration cost for the OE method applied to large-scale deterministic VIs with a certain block structure. Numerical experiments have been conducted to demonstrate the potential advantages of the proposed algorithms. In fact, all these algorithms are applied to solve generalized monotone variational inequality (GMVI) problems whose operator is not necessarily monotone. We will also discuss optimal OE-based policy evaluation methods for reinforcement learning in a companion paper. △ Less

Submitted 19 June, 2023; v1 submitted 5 November, 2020; originally announced November 2020.

MSC Class: 90C25; 90C15; 62L20; 68Q25

arXiv:2010.12169 [pdf, other]

A Feasible Level Proximal Point Method for Nonconvex Sparse Constrained Optimization

Authors: Digvijay Boob, Qi Deng, Guanghui Lan, Yilin Wang

Abstract: Nonconvex sparse models have received significant attention in high-dimensional machine learning. In this paper, we study a new model consisting of a general convex or nonconvex objectives and a variety of continuous nonconvex sparsity-inducing constraints. For this constrained model, we propose a novel proximal point algorithm that solves a sequence of convex subproblems with gradually relaxed co… ▽ More Nonconvex sparse models have received significant attention in high-dimensional machine learning. In this paper, we study a new model consisting of a general convex or nonconvex objectives and a variety of continuous nonconvex sparsity-inducing constraints. For this constrained model, we propose a novel proximal point algorithm that solves a sequence of convex subproblems with gradually relaxed constraint levels. Each subproblem, having a proximal point objective and a convex surrogate constraint, can be efficiently solved based on a fast routine for projection onto the surrogate constraint. We establish the asymptotic convergence of the proposed algorithm to the Karush-Kuhn-Tucker (KKT) solutions. We also establish new convergence complexities to achieve an approximate KKT solution when the objective can be smooth/nonsmooth, deterministic/stochastic and convex/nonconvex with complexity that is on a par with gradient descent for unconstrained optimization problems in respective cases. To the best of our knowledge, this is the first study of the first-order methods with complexity guarantee for nonconvex sparse-constrained problems. We perform numerical experiments to demonstrate the effectiveness of our new model and efficiency of the proposed algorithm for large scale problems. △ Less

Submitted 23 October, 2020; originally announced October 2020.

Comments: Accepted at NeurIPS 2020

arXiv:2008.04827 [pdf, ps, other]

The 4-D Gaussian Random Vector Maximum Conjecture and the 3-D Simplex Mean Width Conjecture

Authors: Wei Sun, Ze-Chun Hu, Guolie Lan

Abstract: We prove the four-dimensional Gaussian random vector maximum conjecture. This conjecture asserts that among all centered Gaussian random vectors $X=(X_1,X_2,X_3,X_4)$ with $E[X_i^2]=1$, $1\le i\le 4$, the expectation $E[\max(X_1,X_2,X_3,X_4)]$ is maximal if and only if all off-diagonal elements of the covariance matrix equal $-\frac{1}{3}$. As a direct consequence, we resolve the three-dimensional… ▽ More We prove the four-dimensional Gaussian random vector maximum conjecture. This conjecture asserts that among all centered Gaussian random vectors $X=(X_1,X_2,X_3,X_4)$ with $E[X_i^2]=1$, $1\le i\le 4$, the expectation $E[\max(X_1,X_2,X_3,X_4)]$ is maximal if and only if all off-diagonal elements of the covariance matrix equal $-\frac{1}{3}$. As a direct consequence, we resolve the three-dimensional simplex mean width conjecture. This latter conjecture is a long-standing open problem in convex geometry, which asserts that among all simplices inscribed into the three-dimensional unit Euclidean ball the regular simplex has the maximal mean width. △ Less

Submitted 15 August, 2020; v1 submitted 10 August, 2020; originally announced August 2020.

MSC Class: 60E15; 52A40

arXiv:2007.00153 [pdf, other]

Conditional Gradient Methods for Convex Optimization with General Affine and Nonlinear Constraints

Authors: Guanghui Lan, Edwin Romeijn, Zhiqiang Zhou

Abstract: Conditional gradient methods have attracted much attention in both machine learning and optimization communities recently. These simple methods can guarantee the generation of sparse solutions. In addition, without the computation of full gradients, they can handle huge-scale problems sometimes even with an exponentially increasing number of decision variables. This paper aims to significantly exp… ▽ More Conditional gradient methods have attracted much attention in both machine learning and optimization communities recently. These simple methods can guarantee the generation of sparse solutions. In addition, without the computation of full gradients, they can handle huge-scale problems sometimes even with an exponentially increasing number of decision variables. This paper aims to significantly expand the application areas of these methods by presenting new conditional gradient methods for solving convex optimization problems with general affine and nonlinear constraints. More specifically, we first present a new constraint extrapolated condition gradient (CoexCG) method that can achieve an ${\cal O}(1/ε^2)$ iteration complexity for both smooth and structured nonsmooth function constrained convex optimization. We further develop novel variants of CoexCG, namely constraint extrapolated and dual regularized conditional gradient (CoexDurCG) methods, that can achieve similar iteration complexity to CoexCG but allow adaptive selection for algorithmic parameters. We illustrate the effectiveness of these methods for solving an important class of radiation therapy treatment planning problems arising from healthcare industry. To the best of our knowledge, all the algorithmic schemes and their complexity results are new in the area of projection-free methods. △ Less

Submitted 29 June, 2021; v1 submitted 30 June, 2020; originally announced July 2020.

arXiv:2007.00132 [pdf, other]

Convex optimization for finite horizon robust covariance control of linear stochastic systems

Authors: Georgios Kotsalis, Guanghui Lan, Arkadi Nemirovski

Abstract: This work addresses the finite-horizon robust covariance control problem for discrete-time, partially observable, linear system affected by random zero mean noise and deterministic but unknown disturbances restricted to lie in what is called ellitopic uncertainty set (e.g., finite intersection of centered at the origin ellipsoids/elliptic cylinders). Performance specifications are imposed on the r… ▽ More This work addresses the finite-horizon robust covariance control problem for discrete-time, partially observable, linear system affected by random zero mean noise and deterministic but unknown disturbances restricted to lie in what is called ellitopic uncertainty set (e.g., finite intersection of centered at the origin ellipsoids/elliptic cylinders). Performance specifications are imposed on the random state-control trajectory via averaged convex quadratic inequalities, linear inequalities on the mean, as well as pre-specified upper bounds on the covariance matrix. For this problem we develop a computationally tractable procedure for designing affine control policies, in the sense that the parameters of the policy that guarantees the aforementioned performance specifications are obtained as solutions to an explicit convex program. Our theoretical findings are illustrated by a numerical example. △ Less

Submitted 30 June, 2020; originally announced July 2020.

Comments: 29 pages, 1 figure

MSC Class: 90C47; 90C22; 49K30; 49M29

arXiv:2006.02032 [pdf, other]

doi 10.1007/s10107-022-01919-z

A Unified Single-loop Alternating Gradient Projection Algorithm for Nonconvex-Concave and Convex-Nonconcave Minimax Problems

Authors: Zi Xu, Huiling Zhang, Yang Xu, Guanghui Lan

Abstract: Much recent research effort has been directed to the development of efficient algorithms for solving minimax problems with theoretical convergence guarantees due to the relevance of these problems to a few emergent applications. In this paper, we propose a unified single-loop alternating gradient projection (AGP) algorithm for solving smooth nonconvex-(strongly) concave and (strongly) convex-nonco… ▽ More Much recent research effort has been directed to the development of efficient algorithms for solving minimax problems with theoretical convergence guarantees due to the relevance of these problems to a few emergent applications. In this paper, we propose a unified single-loop alternating gradient projection (AGP) algorithm for solving smooth nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems. AGP employs simple gradient projection steps for updating the primal and dual variables alternatively at each iteration. We show that it can find an $\varepsilon$-stationary point of the objective function in $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp. $\mathcal{O}\left( \varepsilon ^{-4} \right)$) iterations under nonconvex-strongly concave (resp. nonconvex-concave) setting. Moreover, its gradient complexity to obtain an $\varepsilon$-stationary point of the objective function is bounded by $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp., $\mathcal{O}\left( \varepsilon ^{-4} \right)$) under the strongly convex-nonconcave (resp., convex-nonconcave) setting. To the best of our knowledge, this is the first time that a simple and unified single-loop algorithm is developed for solving both nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems. Moreover, the complexity results for solving the latter (strongly) convex-nonconcave minimax problems have never been obtained before in the literature. Numerical results show the efficiency of the proposed AGP algorithm. Furthermore, we extend the AGP algorithm by presenting a block alternating proximal gradient (BAPG) algorithm for solving more general multi-block nonsmooth nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems. We can similarly establish the gradient complexity of the proposed algorithm under these four different settings. △ Less

Submitted 14 January, 2023; v1 submitted 3 June, 2020; originally announced June 2020.

MSC Class: 90C47; 90C26; 90C30

Journal ref: Mathematical Programming, 2023

arXiv:1912.07702 [pdf, ps, other]

Complexity of Stochastic Dual Dynamic Programming

Authors: Guanghui Lan

Abstract: Stochastic dual dynamic programming is a cutting plane type algorithm for multi-stage stochastic optimization originated about 30 years ago. In spite of its popularity in practice, there does not exist any analysis on the convergence rates of this method. In this paper, we first establish the number of iterations, i.e., iteration complexity, required by a basic dynamic cutting plane method for sol… ▽ More Stochastic dual dynamic programming is a cutting plane type algorithm for multi-stage stochastic optimization originated about 30 years ago. In spite of its popularity in practice, there does not exist any analysis on the convergence rates of this method. In this paper, we first establish the number of iterations, i.e., iteration complexity, required by a basic dynamic cutting plane method for solving relatively simple multi-stage optimization problems, by introducing novel mathematical tools including the saturation of search points. We then refine these basic tools and establish the iteration complexity for both deterministic and stochastic dual dynamic programming methods for solving more general multi-stage stochastic optimization problems under the standard stage-wise independence assumption. Our results indicate that the complexity of some deterministic variants of these methods mildly increases with the number of stages $T$, in fact linearly dependent on $T$ for discounted problems. Therefore, they are efficient for strategic decision making which involves a large number of stages, but with a relatively small number of decision variables in each stage. Without explicitly discretizing the state and action spaces, these methods might also be pertinent to the related reinforcement learning and stochastic control areas. △ Less

Submitted 9 May, 2023; v1 submitted 16 December, 2019; originally announced December 2019.

arXiv:1910.11312 [pdf, ps, other]

Some explorations on two conjectures about Rademacher sequences

Authors: Ze-Chun Hu, Guolie Lan, Wei Sun

Abstract: In this paper, we explore two conjectures about Rademacher sequences. Let $(ε_i)$ be a Rademacher sequence, i.e., a sequence of independent $\{-1,1\}$-valued symmetric random variables. Set $S_n=a_1ε_1+\cdots+a_nε_n$ for $a=(a_1,\dots,a_n)\in \mathbb{R}^n$. The first conjecture says that $P\ (\ |S_n\ |\leq \|a\|\ )\geq\frac{1}{2}$ for all $a\in \mathbb{R}^n$ and $n\in \mathbb{N}$. The second conje… ▽ More In this paper, we explore two conjectures about Rademacher sequences. Let $(ε_i)$ be a Rademacher sequence, i.e., a sequence of independent $\{-1,1\}$-valued symmetric random variables. Set $S_n=a_1ε_1+\cdots+a_nε_n$ for $a=(a_1,\dots,a_n)\in \mathbb{R}^n$. The first conjecture says that $P\ (\ |S_n\ |\leq \|a\|\ )\geq\frac{1}{2}$ for all $a\in \mathbb{R}^n$ and $n\in \mathbb{N}$. The second conjecture says that $P\ (\ |S_n\ |\geq\|a\|\ )\geq \frac{7}{32}$ for all $a\in \mathbb{R}^n$ and $n\in \mathbb{N}$. Regarding the first conjecture, we present several new equivalent formulations. These include a topological view, a combinatorial version and a strengthened version of the conjecture. Regarding the second conjecture, we prove that it holds true when $n\leq 7$. △ Less

Submitted 24 October, 2019; originally announced October 2019.

Comments: 19 pages

MSC Class: 60C05; 60G50

arXiv:1909.11216 [pdf, ps, other]

Efficient Algorithms for Distributionally Robust Stochastic Optimization with Discrete Scenario Support

Authors: Zhe Zhang, Shabbir Ahmed, Guanghui Lan

Abstract: Recently, there has been a growing interest in distributionally robust optimization (DRO) as a principled approach to data-driven decision making. In this paper, we consider a distributionally robust two-stage stochastic optimization problem with discrete scenario support. While much research effort has been devoted to tractable reformulations for DRO problems, especially those with continuous sce… ▽ More Recently, there has been a growing interest in distributionally robust optimization (DRO) as a principled approach to data-driven decision making. In this paper, we consider a distributionally robust two-stage stochastic optimization problem with discrete scenario support. While much research effort has been devoted to tractable reformulations for DRO problems, especially those with continuous scenario support, few efficient numerical algorithms were developed, and most of them can neither handle the non-smooth second-stage cost function nor the large number of scenarios $K$ effectively. We fill the gap by reformulating the DRO problem as a trilinear min-max-max saddle point problem and develo** novel algorithms that can achieve an $\mathcal{O}(1/ε)$ iteration complexity which only mildly depends on $K$. The major computations involved in each iteration of these algorithms can be conducted in parallel if necessary. Besides, for solving an important class of DRO problems with the Kantorovich ball ambiguity set, we propose a slight modification of our algorithms to avoid the expensive computation of the probability vector projection at the price of an $\mathcal{O}(\sqrt{K})$ times more iterations. Finally, preliminary numerical experiments are conducted to demonstrate the empirical advantages of the proposed algorithms. △ Less

Submitted 3 December, 2020; v1 submitted 24 September, 2019; originally announced September 2019.

arXiv:1908.02734 [pdf, ps, other]

Stochastic First-order Methods for Convex and Nonconvex Functional Constrained Optimization

Authors: Digvijay Boob, Qi Deng, Guanghui Lan

Abstract: Functional constrained optimization is becoming more and more important in machine learning and operations research. Such problems have potential applications in risk-averse machine learning, semisupervised learning, and robust optimization among others. In this paper, we first present a novel Constraint Extrapolation (ConEx) method for solving convex functional constrained problems, which utilize… ▽ More Functional constrained optimization is becoming more and more important in machine learning and operations research. Such problems have potential applications in risk-averse machine learning, semisupervised learning, and robust optimization among others. In this paper, we first present a novel Constraint Extrapolation (ConEx) method for solving convex functional constrained problems, which utilizes linear approximations of the constraint functions to define the extrapolation (or acceleration) step. We show that this method is a unified algorithm that achieves the best-known rate of convergence for solving different functional constrained convex composite problems, including convex or strongly convex, and smooth or nonsmooth problems with a stochastic objective and/or stochastic constraints. Many of these rates of convergence were in fact obtained for the first time in the literature. In addition, ConEx is a single-loop algorithm that does not involve any penalty subproblems. Contrary to existing primal-dual methods, it does not require the projection of Lagrangian multipliers into a (possibly unknown) bounded set. Second, for nonconvex functional constrained problems, we introduce a new proximal point method that transforms the initial nonconvex problem into a sequence of convex problems by adding quadratic terms to both the objective and constraints. Under a certain MFCQ-type assumption, we establish the convergence and rate of convergence of this method to KKT points when the convex subproblems are solved exactly or inexactly. For large-scale and stochastic problems, we present a more practical proximal point method in which the approximate solutions of the subproblems are computed by the aforementioned ConEx method. To the best of our knowledge, most of these convergence and complexity results of the proximal point method for nonconvex problems also seem to be new in the literature. △ Less

Submitted 26 January, 2022; v1 submitted 7 August, 2019; originally announced August 2019.

Comments: 36 pages, final version, accepted at Math Programming

arXiv:1905.12412 [pdf, other]

A unified variance-reduced accelerated gradient method for convex optimization

Authors: Guanghui Lan, Zhize Li, Yi Zhou

Abstract: We propose a novel randomized incremental gradient algorithm, namely, VAriance-Reduced Accelerated Gradient (Varag), for finite-sum optimization. Equipped with a unified step-size policy that adjusts itself to the value of the condition number, Varag exhibits the unified optimal rates of convergence for solving smooth convex finite-sum problems directly regardless of their strong convexity. Moreov… ▽ More We propose a novel randomized incremental gradient algorithm, namely, VAriance-Reduced Accelerated Gradient (Varag), for finite-sum optimization. Equipped with a unified step-size policy that adjusts itself to the value of the condition number, Varag exhibits the unified optimal rates of convergence for solving smooth convex finite-sum problems directly regardless of their strong convexity. Moreover, Varag is the first accelerated randomized incremental gradient method that benefits from the strong convexity of the data-fidelity term to achieve the optimal linear convergence. It also establishes an optimal linear rate of convergence for solving a wide class of problems only satisfying a certain error bound condition rather than strong convexity. Varag can also be extended to solve stochastic finite-sum problems. △ Less

Submitted 30 October, 2019; v1 submitted 29 May, 2019; originally announced May 2019.

Comments: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019)

arXiv:1905.04279 [pdf, ps, other]

The Three-Dimensional Gaussian Product Inequality

Authors: Guolie Lan, Ze-Chun Hu, Wei Sun

Abstract: We prove the 3-dimensional Gaussian product inequality, i.e., for any real-valued centered Gaussian random vector $(X,Y,Z)$ and $m\in \mathbb{N}$, it holds that ${\mathbf{E}}[X^{2m}Y^{2m}Z^{2m}]\geq{\mathbf{E}}[X^{2m}]{\mathbf{E}}[Y^{2m}]{\mathbf{E}}[Z^{2m}]$. Our proof is based on some improved inequalities on multi-term products involving 2-dimensional Gaussian random vectors. The improved inequ… ▽ More We prove the 3-dimensional Gaussian product inequality, i.e., for any real-valued centered Gaussian random vector $(X,Y,Z)$ and $m\in \mathbb{N}$, it holds that ${\mathbf{E}}[X^{2m}Y^{2m}Z^{2m}]\geq{\mathbf{E}}[X^{2m}]{\mathbf{E}}[Y^{2m}]{\mathbf{E}}[Z^{2m}]$. Our proof is based on some improved inequalities on multi-term products involving 2-dimensional Gaussian random vectors. The improved inequalities are derived using the Gaussian hypergeometric functions and have independent interest. As by-products, several new combinatorial identities and inequalities are obtained. △ Less

Submitted 10 May, 2019; originally announced May 2019.

MSC Class: 60E15; 62H12

arXiv:1903.03917 [pdf, ps, other]

Products of Conditional Expectation Operators: Convergence and Divergence

Authors: Guolie Lan, Ze-Chun Hu, Wei Sun

Abstract: In this paper, we investigate the convergence of products of conditional expectation operators. We show that if $(Ω,\cal{F},P)$ is a probability space that is not purely atomic, then divergent sequences of products of conditional expectation operators involving 3 or 4 sub-$σ$-fields of $\cal{F}$ can be constructed for a large class of random variables in $L^2(Ω,\cal{F},P)$. This settles in the neg… ▽ More In this paper, we investigate the convergence of products of conditional expectation operators. We show that if $(Ω,\cal{F},P)$ is a probability space that is not purely atomic, then divergent sequences of products of conditional expectation operators involving 3 or 4 sub-$σ$-fields of $\cal{F}$ can be constructed for a large class of random variables in $L^2(Ω,\cal{F},P)$. This settles in the negative a long-open conjecture. On the other hand, we show that if $(Ω,\cal{F},P)$ is a purely atomic probability space, then products of conditional expectation operators involving any finite set of sub-$σ$-fields of $\cal{F}$ must converge for all random variables in $L^1(Ω,\cal{F},P)$. △ Less

Submitted 5 July, 2019; v1 submitted 9 March, 2019; originally announced March 2019.

MSC Class: 60A05; 60F15; 60F25

arXiv:1810.03763 [pdf, other]

Cubic Regularization with Momentum for Nonconvex Optimization

Authors: Zhe Wang, Yi Zhou, Yingbin Liang, Guanghui Lan

Abstract: Momentum is a popular technique to accelerate the convergence in practical training, and its impact on convergence guarantee has been well-studied for first-order algorithms. However, such a successful acceleration technique has not yet been proposed for second-order algorithms in nonconvex optimization.In this paper, we apply the momentum scheme to cubic regularized (CR) Newton's method and explo… ▽ More Momentum is a popular technique to accelerate the convergence in practical training, and its impact on convergence guarantee has been well-studied for first-order algorithms. However, such a successful acceleration technique has not yet been proposed for second-order algorithms in nonconvex optimization.In this paper, we apply the momentum scheme to cubic regularized (CR) Newton's method and explore the potential for acceleration. Our numerical experiments on various nonconvex optimization problems demonstrate that the momentum scheme can substantially facilitate the convergence of cubic regularization, and perform even better than the Nesterov's acceleration scheme for CR. Theoretically, we prove that CR under momentum achieves the best possible convergence rate to a second-order stationary point for nonconvex optimization. Moreover, we study the proposed algorithm for solving problems satisfying an error bound condition and establish a local quadratic convergence rate. Then, particularly for finite-sum problems, we show that the proposed algorithm can allow computational inexactness that reduces the overall sample complexity without degrading the convergence rate. △ Less

Submitted 27 June, 2019; v1 submitted 8 October, 2018; originally announced October 2018.

arXiv:1809.09258 [pdf, other]

Asynchronous decentralized accelerated stochastic gradient descent

Authors: Guanghui Lan, Yi Zhou

Abstract: In this work, we introduce an asynchronous decentralized accelerated stochastic gradient descent type of method for decentralized stochastic optimization, considering communication and synchronization are the major bottlenecks. We establish $\mathcal{O}(1/ε)$ (resp., $\mathcal{O}(1/\sqrtε)$) communication complexity and $\mathcal{O}(1/ε^2)$ (resp., $\mathcal{O}(1/ε)$) sampling complexity for solvi… ▽ More In this work, we introduce an asynchronous decentralized accelerated stochastic gradient descent type of method for decentralized stochastic optimization, considering communication and synchronization are the major bottlenecks. We establish $\mathcal{O}(1/ε)$ (resp., $\mathcal{O}(1/\sqrtε)$) communication complexity and $\mathcal{O}(1/ε^2)$ (resp., $\mathcal{O}(1/ε)$) sampling complexity for solving general convex (resp., strongly convex) problems. △ Less

Submitted 24 September, 2018; originally announced September 2018.

arXiv:1808.07384 [pdf, ps, other]

A Note on Inexact Condition for Cubic Regularized Newton's Method

Authors: Zhe Wang, Yi Zhou, Yingbin Liang, Guanghui Lan

Abstract: This note considers the inexact cubic-regularized Newton's method (CR), which has been shown in \cite{Cartis2011a} to achieve the same order-level convergence rate to a secondary stationary point as the exact CR \citep{Nesterov2006}. However, the inexactness condition in \cite{Cartis2011a} is not implementable due to its dependence on future iterates variable. This note fixes such an issue by prov… ▽ More This note considers the inexact cubic-regularized Newton's method (CR), which has been shown in \cite{Cartis2011a} to achieve the same order-level convergence rate to a secondary stationary point as the exact CR \citep{Nesterov2006}. However, the inexactness condition in \cite{Cartis2011a} is not implementable due to its dependence on future iterates variable. This note fixes such an issue by proving the same convergence rate for nonconvex optimization under an inexact adaptive condition that depends on only the current iterate. Our proof controls the sufficient decrease of the function value over the total iterations rather than each iteration as used in the previous studies, which can be of independent interest in other contexts. △ Less

Submitted 22 August, 2018; originally announced August 2018.

arXiv:1807.08983 [pdf, ps, other]

Strong convergence rates of modified truncated EM methods for neutral stochastic differential delay equations

Authors: Guangqiang Lan, Qiushi Wang

Abstract: The aim of this paper is to investigate strong convergence of modified truncated Euler-Maruyama method for neutral stochastic differential delay equations introduced in Lan (2018). Strong convergence rates of the given numerical scheme to the exact solutions at fixed time $T$ are obtained under local Lipschitz and Khasminskii-type conditions. Moreover, convergence rates over a time interval… ▽ More The aim of this paper is to investigate strong convergence of modified truncated Euler-Maruyama method for neutral stochastic differential delay equations introduced in Lan (2018). Strong convergence rates of the given numerical scheme to the exact solutions at fixed time $T$ are obtained under local Lipschitz and Khasminskii-type conditions. Moreover, convergence rates over a time interval $[0,T]$ are also obtained under additional polynomial growth condition on $g$ without the weak monotonicity condition (which is usually the standard assumption to obtain the convergence rate). Two examples are presented to interpret our conclusions. △ Less

Submitted 24 July, 2018; originally announced July 2018.

Comments: 21 pages

MSC Class: 60H10; 65C30; 65L20

arXiv:1805.05411 [pdf, ps, other]

Accelerated Stochastic Algorithms for Nonconvex Finite-sum and Multi-block Optimization

Authors: Guanghui Lan, Yu Yang

Abstract: In this paper, we present new stochastic methods for solving two important classes of nonconvex optimization problems. We first introduce a randomized accelerated proximal gradient (RapGrad) method for solving a class of nonconvex optimization problems consisting of the sum of $m$ component functions, and show that it can significantly reduce the number of gradient computations especially when the… ▽ More In this paper, we present new stochastic methods for solving two important classes of nonconvex optimization problems. We first introduce a randomized accelerated proximal gradient (RapGrad) method for solving a class of nonconvex optimization problems consisting of the sum of $m$ component functions, and show that it can significantly reduce the number of gradient computations especially when the condition number $L/μ$ (i.e., the ratio between the Lipschitz constant and negative curvature) is large. More specifically, RapGrad can save up to ${\cal O}(\sqrt{m})$ gradient computations than existing deterministic nonconvex accelerated gradient methods. Moreover, the number of gradient computations required by RapGrad can be ${\cal O}(m^\frac{1}{6} L^\frac{1}{2} / μ^\frac{1}{2})$ (at least ${\cal O}(m^\frac{2}{3})$) times smaller than the best-known randomized nonconvex gradient methods when $L/μ\ge m$. Inspired by RapGrad, we also develop a new randomized accelerated proximal dual (RapDual) method for solving a class of multi-block nonconvex optimization problems coupled with linear constraints. We demonstrate that RapDual can also save up to a factor of ${\cal O}(\sqrt{m})$ projection subproblems than its deterministic counterpart, where $m$ denotes the number of blocks. To the best of our knowledge, all these complexity results associated with RapGrad and RapDual seem to be new in the literature. We also illustrate potential advantages of these algorithms through our preliminary numerical experiments. △ Less

Submitted 18 August, 2019; v1 submitted 14 May, 2018; originally announced May 2018.

arXiv:1802.07372 [pdf, ps, other]

Stochastic Variance-Reduced Cubic Regularization for Nonconvex Optimization

Authors: Zhe Wang, Yi Zhou, Yingbin Liang, Guanghui Lan

Abstract: Cubic regularization (CR) is an optimization method with emerging popularity due to its capability to escape saddle points and converge to second-order stationary solutions for nonconvex optimization. However, CR encounters a high sample complexity issue for finite-sum problems with a large data size. %Various inexact variants of CR have been proposed to improve the sample complexity. In this pape… ▽ More Cubic regularization (CR) is an optimization method with emerging popularity due to its capability to escape saddle points and converge to second-order stationary solutions for nonconvex optimization. However, CR encounters a high sample complexity issue for finite-sum problems with a large data size. %Various inexact variants of CR have been proposed to improve the sample complexity. In this paper, we propose a stochastic variance-reduced cubic-regularization (SVRC) method under random sampling, and study its convergence guarantee as well as sample complexity. We show that the iteration complexity of SVRC for achieving a second-order stationary solution within $ε$ accuracy is $O(ε^{-3/2})$, which matches the state-of-art result on CR types of methods. Moreover, our proposed variance reduction scheme significantly reduces the per-iteration sample complexity. The resulting total Hessian sample complexity of our SVRC is ${\Oc}(N^{2/3} ε^{-3/2})$, which outperforms the state-of-art result by a factor of $O(N^{2/15})$. We also study our SVRC under random sampling without replacement scheme, which yields a lower per-iteration sample complexity, and hence justifies its practical applicability. △ Less

Submitted 8 October, 2018; v1 submitted 20 February, 2018; originally announced February 2018.

arXiv:1801.04517 [pdf, ps, other]

Polynomial stability of exact solution and a numerical method for stochastic differential equations with time-dependent delay

Authors: Guangqiang Lan, Fang Xia, Qiushi Wang

Abstract: Polynomial stability of exact solution and modified truncated Euler-Maruyama method for stochastic differential equations with time-dependent delay are investigated in this paper. By using the well known discrete semimartingale convergence theorem, sufficient conditions are obtained for both bounded and unbounded delay $δ$ to ensure the polynomial stability of the corresponding numerical approxima… ▽ More Polynomial stability of exact solution and modified truncated Euler-Maruyama method for stochastic differential equations with time-dependent delay are investigated in this paper. By using the well known discrete semimartingale convergence theorem, sufficient conditions are obtained for both bounded and unbounded delay $δ$ to ensure the polynomial stability of the corresponding numerical approximation. Examples are presented to illustrate the conclusion. △ Less

Submitted 14 January, 2018; originally announced January 2018.

MSC Class: 60H10; 65C30

Showing 1–50 of 88 results for author: Lan, G