Search | arXiv e-print repository

A Generalized Version of Chung's Lemma and its Applications

Authors: Li Jiang, Xiao Li, Andre Milzarek, Junwen Qiu

Abstract: Chung's lemma is a classical tool for establishing asymptotic convergence rates of (stochastic) optimization methods under strong convexity-type assumptions and appropriate polynomial diminishing step sizes. In this work, we develop a generalized version of Chung's lemma, which provides a simple non-asymptotic convergence framework for a more general family of step size rules. We demonstrate broad… ▽ More Chung's lemma is a classical tool for establishing asymptotic convergence rates of (stochastic) optimization methods under strong convexity-type assumptions and appropriate polynomial diminishing step sizes. In this work, we develop a generalized version of Chung's lemma, which provides a simple non-asymptotic convergence framework for a more general family of step size rules. We demonstrate broad applicability of the proposed generalized Chung's lemma by deriving tight non-asymptotic convergence rates for a large variety of stochastic methods. In particular, we obtain partially new non-asymptotic complexity results for stochastic optimization methods, such as stochastic gradient descent and random reshuffling, under a general $(θ,μ)$-Polyak-Lojasiewicz (PL) condition and for various step sizes strategies, including polynomial, constant, exponential, and cosine step sizes rules. Notably, as a by-product of our analysis, we observe that exponential step sizes can adapt to the objective function's geometry, achieving the optimal convergence rate without requiring exact knowledge of the underlying landscape. Our results demonstrate that the developed variant of Chung's lemma offers a versatile, systematic, and streamlined approach to establish non-asymptotic convergence rates under general step size rules. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: 43 pages, 5 figures

MSC Class: 90C15; 90C30; 90C26

arXiv:2406.02273 [pdf, ps, other]

A KL-based Analysis Framework with Applications to Non-Descent Optimization Methods

Authors: Junwen Qiu, Bohao Ma, Xiao Li, Andre Milzarek

Abstract: We propose a novel analysis framework for non-descent-type optimization methodologies in nonconvex scenarios based on the Kurdyka-Lojasiewicz property. Our framework allows covering a broad class of algorithms, including those commonly employed in stochastic and distributed optimization. Specifically, it enables the analysis of first-order methods that lack a sufficient descent property and do not… ▽ More We propose a novel analysis framework for non-descent-type optimization methodologies in nonconvex scenarios based on the Kurdyka-Lojasiewicz property. Our framework allows covering a broad class of algorithms, including those commonly employed in stochastic and distributed optimization. Specifically, it enables the analysis of first-order methods that lack a sufficient descent property and do not require access to full (deterministic) gradient information. We leverage this framework to establish, for the first time, iterate convergence and the corresponding rates for the decentralized gradient method and federated averaging under mild assumptions. Furthermore, based on the new analysis techniques, we show the convergence of the random reshuffling and stochastic gradient descent method without necessitating typical a priori bounded iterates assumptions. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: 29 pages

MSC Class: 90C06; 90C26; 90C30

arXiv:2405.16954 [pdf, ps, other]

Convergence of SGD with momentum in the nonconvex case: A time window-based analysis

Authors: Junwen Qiu, Bohao Ma, Andre Milzarek

Abstract: We propose a novel time window-based analysis technique to investigate the convergence properties of the stochastic gradient descent method with momentum (SGDM) in nonconvex settings. Despite its popularity, the convergence behavior of SGDM remains less understood in nonconvex scenarios. This is primarily due to the absence of a sufficient descent property and challenges in simultaneously controll… ▽ More We propose a novel time window-based analysis technique to investigate the convergence properties of the stochastic gradient descent method with momentum (SGDM) in nonconvex settings. Despite its popularity, the convergence behavior of SGDM remains less understood in nonconvex scenarios. This is primarily due to the absence of a sufficient descent property and challenges in simultaneously controlling the momentum and stochastic errors in an almost sure sense. To address these challenges, we investigate the behavior of SGDM over specific time windows, rather than examining the descent of consecutive iterates as in traditional studies. This time window-based approach simplifies the convergence analysis and enables us to establish the first iterate convergence result for SGDM under the Kurdyka-Lojasiewicz (KL) property. We further provide local convergence rates which depend on the underlying KL exponent and the utilized step size schemes. △ Less

Submitted 23 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

Comments: 25 pages

arXiv:2404.18452 [pdf, other]

Random Reshuffling with Momentum for Nonconvex Problems: Iteration Complexity and Last Iterate Convergence

Authors: Junwen Qiu, Andre Milzarek

Abstract: Random reshuffling with momentum (RRM) corresponds to the SGD optimizer with momentum option enabled, as found in popular machine learning libraries like PyTorch and TensorFlow. Despite its widespread use in practical applications, the understanding of its convergence properties in nonconvex scenarios remains limited. Under a Lipschitz smoothness assumption, this paper provides one of the first it… ▽ More Random reshuffling with momentum (RRM) corresponds to the SGD optimizer with momentum option enabled, as found in popular machine learning libraries like PyTorch and TensorFlow. Despite its widespread use in practical applications, the understanding of its convergence properties in nonconvex scenarios remains limited. Under a Lipschitz smoothness assumption, this paper provides one of the first iteration complexities for RRM. Specifically, we prove that RRM achieves the iteration complexity $O(n^{-1/3}((1-β^n)T)^{-2/3})$ where $n$ denotes the number of component functions $f(\cdot;i)$ and $β\in [0,1)$ is the momentum parameter. Furthermore, every accumulation point of a sequence of iterates $\{x^k\}_k$ generated by RRM is shown to be a stationary point of the problem. In addition, under the Kurdyka-Lojasiewicz inequality - a local geometric property - the iterates $\{x^k\}_k$ provably converge to a unique stationary point $x^*$ of the objective function. Importantly, in our analysis, this last iterate convergence is obtained without requiring convexity nor a priori boundedness of the iterates. Finally, for polynomial step size schemes, convergence rates of the form $\|x^k - x^*\| = O(k^{-p})$, $\|\nabla f(x^k)\|^2 = O(k^{-q})$, and $|f(x^k) - f(x^*)| = O(k^{-q})$, $p \in (0,1]$, $q \in (0,2]$ are derived. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: 51 pages, 10 figures

MSC Class: 90C26; 90C15

arXiv:2312.01047 [pdf, other]

A New Random Reshuffling Method for Nonsmooth Nonconvex Finite-sum Optimization

Authors: Junwen Qiu, Xiao Li, Andre Milzarek

Abstract: Random reshuffling techniques are prevalent in large-scale applications, such as training neural networks. While the convergence and acceleration effects of random reshuffling-type methods are fairly well understood in the smooth setting, much less studies seem available in the nonsmooth case. In this work, we design a new normal map-based proximal random reshuffling (norm-PRR) method for nonsmoot… ▽ More Random reshuffling techniques are prevalent in large-scale applications, such as training neural networks. While the convergence and acceleration effects of random reshuffling-type methods are fairly well understood in the smooth setting, much less studies seem available in the nonsmooth case. In this work, we design a new normal map-based proximal random reshuffling (norm-PRR) method for nonsmooth nonconvex finite-sum problems. We show that norm-PRR achieves the iteration complexity $O(n^{-1/3}T^{-2/3})$ where $n$ denotes the number of component functions $f(\cdot,i)$ and $T$ counts the total number of iterations. This improves the currently known complexity bounds for this class of problems by a factor of $n^{-1/3}$. In addition, we prove that norm-PRR converges linearly under the (global) Polyak-Lojasiewicz condition and in the interpolation setting. We further complement these non-asymptotic results and provide an in-depth analysis of the asymptotic properties of norm-PRR. Specifically, under the (local) Kurdyka-Lojasiewicz inequality, the whole sequence of iterates generated by norm-PRR is shown to converge to a single stationary point. Moreover, we derive last iterate convergence rates that can match those in the smooth, strongly convex setting. Finally, numerical experiments are performed on nonconvex classification tasks to illustrate the efficiency of the proposed approach. △ Less

Submitted 30 April, 2024; v1 submitted 2 December, 2023; originally announced December 2023.

Comments: 43 pages, 4 figures

MSC Class: 90C26; 90C15

arXiv:2311.07276 [pdf, other]

Variational Properties of Decomposable Functions Part II: Strong Second-Order Theory

Authors: Wenqing Ouyang, Andre Milzarek

Abstract: Local superlinear convergence of the semismooth Newton method usually requires the uniform invertibility of the generalized Jacobian matrix, e.g. BD-regularity or CD-regularity. For several types of nonlinear programming and composite-type optimization problems -- for which the generalized Jacobian of the stationary equation can be calculated explicitly -- this is characterized by the strong secon… ▽ More Local superlinear convergence of the semismooth Newton method usually requires the uniform invertibility of the generalized Jacobian matrix, e.g. BD-regularity or CD-regularity. For several types of nonlinear programming and composite-type optimization problems -- for which the generalized Jacobian of the stationary equation can be calculated explicitly -- this is characterized by the strong second-order sufficient condition. However, general characterizations are still not well understood. In this paper, we propose a strong second-order sufficient condition (SSOSC) for composite problems whose nonsmooth part has a generalized conic-quadratic second subderivative. We then discuss the relationship between the SSOSC and another second order-type condition that involves the generalized Jacobians of the normal map. In particular, these two conditions are equivalent under certain structural assumptions on the generalized Jacobian matrix of the proximity operator. Next, we verify these structural assumptions for $C^2$-strictly decomposable functions via analyzing their second-order variational properties under additional geometric assumptions on the support set of the decomposition pair. Finally, we show that the SSOSC is further equivalent to the strong metric regularity condition of the subdifferential, the normal map, and the natural residual. Counterexamples illustrate the necessity of our assumptions. △ Less

Submitted 13 November, 2023; originally announced November 2023.

Comments: 28 pages; preliminary draft

arXiv:2311.07267 [pdf, ps, other]

Variational Properties of Decomposable Functions. Part I: Strict Epi-Calculus and Applications

Authors: Wenqing Ouyang, Andre Milzarek

Abstract: We provide systematic studies of the variational properties of decomposable functions which are compositions of an outer support function and an inner smooth map** under certain constraint qualifications. We put a particular focus on the strict twice epi-differentiability and the associated strict second subderivative of such functions. Calculus rules for the (strict) second subderivative and tw… ▽ More We provide systematic studies of the variational properties of decomposable functions which are compositions of an outer support function and an inner smooth map** under certain constraint qualifications. We put a particular focus on the strict twice epi-differentiability and the associated strict second subderivative of such functions. Calculus rules for the (strict) second subderivative and twice epi-differentiability of decomposable functions are derived which allow us to link the (strict) second subderivative of decomposable map**s to the simpler outer support function. Applying the variational properties of the support function, we establish the equivalence between strict twice epi-differentiability of decomposable functions, continuous differentiability of its proximity operator, and the strict complementarity condition. This allows us to fully characterize the strict saddle point property of decomposable functions. In addition, we give a formula for the strict second subderivative for decomposable functions whose outer support set is sufficiently regular. This provides an alternative characterization of the strong metric regularity of the subdifferential of decomposable functions. Finally, we verify that these introduced regularity conditions are satisfied by many practical functions. △ Less

Submitted 13 November, 2023; originally announced November 2023.

Comments: 28 pages; preliminary draft

arXiv:2309.17096 [pdf, other]

Obtaining Pseudo-inverse Solutions With MINRES

Authors: Yang Liu, Andre Milzarek, Fred Roosta

Abstract: The celebrated minimum residual method (MINRES), proposed in the seminal paper of Paige and Saunders, has seen great success and wide-spread use in solving linear least-squared problems involving Hermitian matrices, with further extensions to complex symmetric settings. Unless the system is consistent whereby the right-hand side vector lies in the range of the matrix, MINRES is not guaranteed to o… ▽ More The celebrated minimum residual method (MINRES), proposed in the seminal paper of Paige and Saunders, has seen great success and wide-spread use in solving linear least-squared problems involving Hermitian matrices, with further extensions to complex symmetric settings. Unless the system is consistent whereby the right-hand side vector lies in the range of the matrix, MINRES is not guaranteed to obtain the pseudo-inverse solution. Variants of MINRES, such as MINRES-QLP, which can achieve such minimum norm solutions, are known to be both computationally expensive and challenging to implement. We propose a novel and remarkably simple lifting strategy that seamlessly integrates with the final MINRES iteration, enabling us to obtain the minimum norm solution with negligible additional computational costs. We study our lifting strategy in a diverse range of settings encompassing Hermitian and complex symmetric systems as well as those with semi-definite preconditioners. We also provide numerical experiments to support our analysis and showcase the effects of our lifting strategy. △ Less

Submitted 29 January, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

arXiv:2305.05828 [pdf, other]

Convergence of a Normal Map-based Prox-SGD Method under the KL Inequality

Authors: Andre Milzarek, Junwen Qiu

Abstract: In this paper, we present a novel stochastic normal map-based algorithm ($\mathsf{norM}\text{-}\mathsf{SGD}$) for nonconvex composite-type optimization problems and discuss its convergence properties. Using a time window-based strategy, we first analyze the global convergence behavior of $\mathsf{norM}\text{-}\mathsf{SGD}$ and it is shown that every accumulation point of the generated sequence of… ▽ More In this paper, we present a novel stochastic normal map-based algorithm ($\mathsf{norM}\text{-}\mathsf{SGD}$) for nonconvex composite-type optimization problems and discuss its convergence properties. Using a time window-based strategy, we first analyze the global convergence behavior of $\mathsf{norM}\text{-}\mathsf{SGD}$ and it is shown that every accumulation point of the generated sequence of iterates $\{\boldsymbol{x}^k\}_k$ corresponds to a stationary point almost surely and in an expectation sense. The obtained results hold under standard assumptions and extend the more limited convergence guarantees of the basic proximal stochastic gradient method. In addition, based on the well-known Kurdyka-Łojasiewicz (KL) analysis framework, we provide novel point-wise convergence results for the iterates $\{\boldsymbol{x}^k\}_k$ and derive convergence rates that depend on the underlying KL exponent $\boldsymbolθ$ and the step size dynamics $\{α_k\}_k$. Specifically, for the popular step size scheme $α_k=\mathcal{O}(1/k^γ)$, $γ\in (\frac23,1]$, (almost sure) rates of the form $\|\boldsymbol{x}^k-\boldsymbol{x}^*\| = \mathcal{O}(1/k^p)$, $p \in (0,\frac12)$, can be established. The obtained rates are faster than related and existing convergence rates for $\mathsf{SGD}$ and improve on the non-asymptotic complexity bounds for $\mathsf{norM}\text{-}\mathsf{SGD}$. △ Less

Submitted 9 May, 2023; originally announced May 2023.

Comments: 34 pages, 14 figures

MSC Class: 90C26; 90C15

arXiv:2206.03907 [pdf, ps, other]

A Unified Convergence Theorem for Stochastic Optimization Methods

Authors: Xiao Li, Andre Milzarek

Abstract: In this work, we provide a fundamental unified convergence theorem used for deriving expected and almost sure convergence results for a series of stochastic optimization methods. Our unified theorem only requires to verify several representative conditions and is not tailored to any specific algorithm. As a direct application, we recover expected and almost sure convergence results of the stochast… ▽ More In this work, we provide a fundamental unified convergence theorem used for deriving expected and almost sure convergence results for a series of stochastic optimization methods. Our unified theorem only requires to verify several representative conditions and is not tailored to any specific algorithm. As a direct application, we recover expected and almost sure convergence results of the stochastic gradient method (SGD) and random reshuffling (RR) under more general settings. Moreover, we establish new expected and almost sure convergence results for the stochastic proximal gradient method (prox-SGD) and stochastic model-based methods (SMM) for nonsmooth nonconvex optimization problems. These applications reveal that our unified theorem provides a plugin-type convergence analysis and strong convergence guarantees for a wide class of stochastic optimization methods. △ Less

Submitted 19 October, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

Comments: Accepted in the 36th Conference on Neural Information Processing Systems (NeurIPS 2022)

arXiv:2206.01372 [pdf, other]

Descent Properties of an Anderson Accelerated Gradient Method With Restarting

Authors: Wenqing Ouyang, Yang Liu, Andre Milzarek

Abstract: Anderson Acceleration (AA) is a popular acceleration technique to enhance the convergence of fixed-point iterations. The analysis of AA approaches typically focuses on the convergence behavior of a corresponding fixed-point residual, while the behavior of the underlying objective function values along the accelerated iterates is currently not well understood. In this paper, we investigate local pr… ▽ More Anderson Acceleration (AA) is a popular acceleration technique to enhance the convergence of fixed-point iterations. The analysis of AA approaches typically focuses on the convergence behavior of a corresponding fixed-point residual, while the behavior of the underlying objective function values along the accelerated iterates is currently not well understood. In this paper, we investigate local properties of AA with restarting applied to a basic gradient scheme in terms of function values. Specifically, we show that AA with restarting is a local descent method and that it can decrease the objective function faster than the gradient method. These new results theoretically support the good numerical performance of AA when heuristic descent conditions are used for globalization and they provide a novel perspective on the convergence analysis of AA that is more amenable to nonconvex optimization problems. Numerical experiments are conducted to illustrate our theoretical findings. △ Less

Submitted 25 September, 2023; v1 submitted 2 June, 2022; originally announced June 2022.

Comments: 30 pages; 4 figures

MSC Class: 90C26

arXiv:2204.00406 [pdf, other]

doi 10.1137/22M1488181

A Semismooth Newton Stochastic Proximal Point Algorithm with Variance Reduction

Authors: Andre Milzarek, Fabian Schaipp, Michael Ulbrich

Abstract: We develop an implementable stochastic proximal point (SPP) method for a class of weakly convex, composite optimization problems. The proposed stochastic proximal point algorithm incorporates a variance reduction mechanism and the resulting SPP updates are solved using an inexact semismooth Newton framework. We establish detailed convergence results that take the inexactness of the SPP steps into… ▽ More We develop an implementable stochastic proximal point (SPP) method for a class of weakly convex, composite optimization problems. The proposed stochastic proximal point algorithm incorporates a variance reduction mechanism and the resulting SPP updates are solved using an inexact semismooth Newton framework. We establish detailed convergence results that take the inexactness of the SPP steps into account and that are in accordance with existing convergence guarantees of (proximal) stochastic variance-reduced gradient methods. Numerical experiments show that the proposed algorithm competes favorably with other state-of-the-art methods and achieves higher robustness with respect to the step size selection. △ Less

Submitted 26 March, 2024; v1 submitted 1 April, 2022; originally announced April 2022.

MSC Class: 90C26; 90C06; 65K10

arXiv:2112.15287 [pdf, other]

Distributed Random Reshuffling over Networks

Authors: Kun Huang, Xiao Li, Andre Milzarek, Shi Pu, Junwen Qiu

Abstract: In this paper, we consider distributed optimization problems where $n$ agents, each possessing a local cost function, collaboratively minimize the average of the local cost functions over a connected network. To solve the problem, we propose a distributed random reshuffling (D-RR) algorithm that invokes the random reshuffling (RR) update in each agent. We show that D-RR inherits favorable characte… ▽ More In this paper, we consider distributed optimization problems where $n$ agents, each possessing a local cost function, collaboratively minimize the average of the local cost functions over a connected network. To solve the problem, we propose a distributed random reshuffling (D-RR) algorithm that invokes the random reshuffling (RR) update in each agent. We show that D-RR inherits favorable characteristics of RR for both smooth strongly convex and smooth nonconvex objective functions. In particular, for smooth strongly convex objective functions, D-RR achieves $\mathcal{O}(1/T^2)$ rate of convergence (where $T$ counts epoch number) in terms of the squared distance between the iterate and the global minimizer. When the objective function is assumed to be smooth nonconvex, we show that D-RR drives the squared norm of gradient to $0$ at a rate of $\mathcal{O}(1/T^{2/3})$. These convergence results match those of centralized RR (up to constant factors) and outperform the distributed stochastic gradient descent (DSGD) algorithm if we run a relatively large number of epochs. Finally, we conduct a set of numerical experiments to illustrate the efficiency of the proposed D-RR method on both strongly convex and nonconvex distributed optimization problems. △ Less

Submitted 23 March, 2023; v1 submitted 30 December, 2021; originally announced December 2021.

Comments: 20 pages, 13 figures

arXiv:2110.04926 [pdf, ps, other]

Convergence of Random Reshuffling Under The Kurdyka-Łojasiewicz Inequality

Authors: Xiao Li, Andre Milzarek, Junwen Qiu

Abstract: We study the random reshuffling (RR) method for smooth nonconvex optimization problems with a finite-sum structure. Though this method is widely utilized in practice such as the training of neural networks, its convergence behavior is only understood in several limited settings. In this paper, under the well-known Kurdyka-Lojasiewicz (KL) inequality, we establish strong limit-point convergence res… ▽ More We study the random reshuffling (RR) method for smooth nonconvex optimization problems with a finite-sum structure. Though this method is widely utilized in practice such as the training of neural networks, its convergence behavior is only understood in several limited settings. In this paper, under the well-known Kurdyka-Lojasiewicz (KL) inequality, we establish strong limit-point convergence results for RR with appropriate diminishing step sizes, namely, the whole sequence of iterates generated by RR is convergent and converges to a single stationary point in an almost sure sense. In addition, we derive the corresponding rate of convergence, depending on the KL exponent and the suitably selected diminishing step sizes. When the KL exponent lies in $[0,\frac12]$, the convergence is at a rate of $\mathcal{O}(t^{-1})$ with $t$ counting the iteration number. When the KL exponent belongs to $(\frac12,1)$, our derived convergence rate is of the form $\mathcal{O}(t^{-q})$ with $q\in (0,1)$ depending on the KL exponent. The standard KL inequality-based convergence analysis framework only applies to algorithms with a certain descent property. We conduct a novel convergence analysis for the non-descent RR method with diminishing step sizes based on the KL inequality, which generalizes the standard KL framework. We summarize our main steps and core ideas in an informal analysis framework, which is of independent interest. As a direct application of this framework, we also establish similar strong limit-point convergence results for the reshuffled proximal point method. △ Less

Submitted 25 January, 2023; v1 submitted 10 October, 2021; originally announced October 2021.

Comments: Accepted for publication in SIAM Journal on Optimization

arXiv:2106.09340 [pdf, other]

A trust region-type normal map-based semismooth Newton method for nonsmooth nonconvex composite optimization

Authors: Wenqing Ouyang, Andre Milzarek

Abstract: We propose a novel trust region method for solving a class of nonsmooth, nonconvex composite-type optimization problems. The approach embeds inexact semismooth Newton steps for finding zeros of a normal map-based stationarity measure for the problem in a trust region framework. Based on a new merit function and acceptance mechanism, global convergence and transition to fast local q-superlinear con… ▽ More We propose a novel trust region method for solving a class of nonsmooth, nonconvex composite-type optimization problems. The approach embeds inexact semismooth Newton steps for finding zeros of a normal map-based stationarity measure for the problem in a trust region framework. Based on a new merit function and acceptance mechanism, global convergence and transition to fast local q-superlinear convergence are established under standard conditions. In addition, we verify that the proposed trust region globalization is compatible with the Kurdyka-Łojasiewicz inequality yielding finer convergence results. We further derive new normal map-based representations of the associated second-order optimality conditions that have direct connections to the local assumptions required for fast convergence. Finally, we study the behavior of our algorithm when the Hessian matrix of the smooth part of the objective function is approximated by BFGS updates. We successfully link the KL theory, properties of the BFGS approximations, and a Dennis-Mor{é}-type condition to show superlinear convergence of the quasi-Newton version of our method. Numerical experiments on sparse logistic regression, image compression, and a constrained log-determinant problem illustrate the efficiency of the proposed algorithm. △ Less

Submitted 3 October, 2023; v1 submitted 17 June, 2021; originally announced June 2021.

Comments: 55 pages, 5 figures

arXiv:2006.02559 [pdf, other]

Nonmonotone Globalization for Anderson Acceleration via Adaptive Regularization

Authors: Wenqing Ouyang, Jiong Tao, Andre Milzarek, Bailin Deng

Abstract: Anderson acceleration (AA) is a popular method for accelerating fixed-point iterations, but may suffer from instability and stagnation. We propose a globalization method for AA to improve stability and achieve unified global and local convergence. Unlike existing AA globalization approaches that rely on safeguarding operations and might hinder fast local convergence, we adopt a nonmonotone trust-r… ▽ More Anderson acceleration (AA) is a popular method for accelerating fixed-point iterations, but may suffer from instability and stagnation. We propose a globalization method for AA to improve stability and achieve unified global and local convergence. Unlike existing AA globalization approaches that rely on safeguarding operations and might hinder fast local convergence, we adopt a nonmonotone trust-region framework and introduce an adaptive quadratic regularization together with a tailored acceptance mechanism. We prove global convergence and show that our algorithm attains the same local convergence as AA under appropriate assumptions. The effectiveness of our method is demonstrated in several numerical experiments. △ Less

Submitted 2 May, 2023; v1 submitted 3 June, 2020; originally announced June 2020.

Comments: Accepted to Journal of Scientific Computing

arXiv:2002.08513 [pdf, ps, other]

A Trust-Region Method For Nonsmooth Nonconvex Optimization

Authors: Ziang Chen, Andre Milzarek, Zaiwen Wen

Abstract: We propose a trust-region type method for a class of nonsmooth nonconvex optimization problems where the objective function is a summation of a (probably nonconvex) smooth function and a (probably nonsmooth) convex function. The model function of our trust-region subproblem is always quadratic and the linear term of the model is generated using abstract descent directions. Therefore, the trust-reg… ▽ More We propose a trust-region type method for a class of nonsmooth nonconvex optimization problems where the objective function is a summation of a (probably nonconvex) smooth function and a (probably nonsmooth) convex function. The model function of our trust-region subproblem is always quadratic and the linear term of the model is generated using abstract descent directions. Therefore, the trust-region subproblems can be easily constructed as well as efficiently solved by cheap and standard methods. When the accuracy of the model function at the solution of the subproblem is not sufficient, we add a safeguard on the stepsizes for improving the accuracy. For a class of functions that can be "truncated", an additional truncation step is defined and a stepsize modification strategy is designed. The overall scheme converges globally and we establish fast local convergence under suitable assumptions. In particular, using a connection with a smooth Riemannian trust-region method, we prove local quadratic convergence for partly smooth functions under a strict complementary condition. Preliminary numerical results on a family of $\ell_1$-optimization problems are reported and demonstrate the efficiency of our approach. △ Less

Submitted 23 October, 2021; v1 submitted 19 February, 2020; originally announced February 2020.

arXiv:1910.09373 [pdf, ps, other]

A Stochastic Extra-Step Quasi-Newton Method for Nonsmooth Nonconvex Optimization

Authors: Minghan Yang, Andre Milzarek, Zaiwen Wen, Tong Zhang

Abstract: In this paper, a novel stochastic extra-step quasi-Newton method is developed to solve a class of nonsmooth nonconvex composite optimization problems. We assume that the gradient of the smooth part of the objective function can only be approximated by stochastic oracles. The proposed method combines general stochastic higher order steps derived from an underlying proximal type fixed-point equation… ▽ More In this paper, a novel stochastic extra-step quasi-Newton method is developed to solve a class of nonsmooth nonconvex composite optimization problems. We assume that the gradient of the smooth part of the objective function can only be approximated by stochastic oracles. The proposed method combines general stochastic higher order steps derived from an underlying proximal type fixed-point equation with additional stochastic proximal gradient steps to guarantee convergence. Based on suitable bounds on the step sizes, we establish global convergence to stationary points in expectation and an extension of the approach using variance reduction techniques is discussed. Motivated by large-scale and big data applications, we investigate a stochastic coordinate-type quasi-Newton scheme that allows to generate cheap and tractable stochastic higher order directions. Finally, the proposed algorithm is tested on large-scale logistic regression and deep learning problems and it is shown that it compares favorably with other state-of-the-art methods. △ Less

Submitted 21 October, 2019; originally announced October 2019.

Comments: 41 pages

MSC Class: 90C06; 90C15; 90C26; 90C53

arXiv:1908.00745 [pdf, ps, other]

On The Geometric Analysis of A Quartic-quadratic Optimization Problem under A Spherical Constraint

Authors: Haixiang Zhang, Andre Milzarek, Zaiwen Wen, Wotao Yin

Abstract: This paper considers the problem of solving a special quartic-quadratic optimization problem with a single sphere constraint, namely, finding a global and local minimizer of $\frac{1}{2}\mathbf{z}^{*}A\mathbf{z}+\fracβ{2}\sum_{k=1}^{n}\lvert z_{k}\rvert^{4}$ such that $\lVert\mathbf{z}\rVert_{2}=1$. This problem spans multiple domains including quantum mechanics and chemistry sciences and we inves… ▽ More This paper considers the problem of solving a special quartic-quadratic optimization problem with a single sphere constraint, namely, finding a global and local minimizer of $\frac{1}{2}\mathbf{z}^{*}A\mathbf{z}+\fracβ{2}\sum_{k=1}^{n}\lvert z_{k}\rvert^{4}$ such that $\lVert\mathbf{z}\rVert_{2}=1$. This problem spans multiple domains including quantum mechanics and chemistry sciences and we investigate the geometric properties of this optimization problem. Fourth-order optimality conditions are derived for characterizing local and global minima. When the matrix in the quadratic term is diagonal, the problem has no spurious local minima and global solutions can be represented explicitly and calculated in $O(n\log{n})$ operations. When $A$ is a rank one matrix, the global minima of the problem are unique under certain phase shift schemes. The strict-saddle property, which can imply polynomial time convergence of second-order-type algorithms, is established when the coefficient $β$ of the quartic term is either at least $O(n^{3/2})$ or not larger than $O(1)$. Finally, the Kurdyka-Lojasiewicz exponent of quartic-quadratic problem is estimated and it is shown that the exponent is ${1}/{4}$ for a broad class of stationary points. △ Less

Submitted 2 August, 2019; originally announced August 2019.

arXiv:1803.03466 [pdf, ps, other]

A Stochastic Semismooth Newton Method for Nonsmooth Nonconvex Optimization

Authors: Andre Milzarek, Xiantao Xiao, Shicong Cen, Zaiwen Wen, Michael Ulbrich

Abstract: In this work, we present a globalized stochastic semismooth Newton method for solving stochastic optimization problems involving smooth nonconvex and nonsmooth convex terms in the objective function. We assume that only noisy gradient and Hessian information of the smooth part of the objective function is available via calling stochastic first and second order oracles. The proposed method can be s… ▽ More In this work, we present a globalized stochastic semismooth Newton method for solving stochastic optimization problems involving smooth nonconvex and nonsmooth convex terms in the objective function. We assume that only noisy gradient and Hessian information of the smooth part of the objective function is available via calling stochastic first and second order oracles. The proposed method can be seen as a hybrid approach combining stochastic semismooth Newton steps and stochastic proximal gradient steps. Two inexact growth conditions are incorporated to monitor the convergence and the acceptance of the semismooth Newton steps and it is shown that the algorithm converges globally to stationary points in expectation. Moreover, under standard assumptions and utilizing random matrix concentration inequalities, we prove that the proposed approach locally turns into a pure stochastic semismooth Newton method and converges r-superlinearly with high probability. We present numerical results and comparisons on $\ell_1$-regularized logistic regression and nonconvex binary classification that demonstrate the efficiency of our algorithm. △ Less

Submitted 9 March, 2018; originally announced March 2018.

MSC Class: 49M15; 65C60; 65K05; 90C06

arXiv:1708.02016 [pdf, ps, other]

Adaptive Regularized Newton Method for Riemannian Optimization

Authors: Jiang Hu, Andre Milzarek, Zaiwen Wen, Yaxiang Yuan

Abstract: Optimization on Riemannian manifolds widely arises in eigenvalue computation, density functional theory, Bose-Einstein condensates, low rank nearest correlation, image registration, and signal processing, etc. We propose an adaptive regularized Newton method which approximates the original objective function by the second-order Taylor expansion in Euclidean space but keeps the Riemannian manifold… ▽ More Optimization on Riemannian manifolds widely arises in eigenvalue computation, density functional theory, Bose-Einstein condensates, low rank nearest correlation, image registration, and signal processing, etc. We propose an adaptive regularized Newton method which approximates the original objective function by the second-order Taylor expansion in Euclidean space but keeps the Riemannian manifold constraints. The regularization term in the objective function of the subproblem enables us to establish a Cauchy-point like condition as the standard trust-region method for proving global convergence. The subproblem can be solved inexactly either by first-order methods or a modified Riemannian Newton method. In the later case, it can further take advantage of negative curvature directions. Both global convergence and superlinear local convergence are guaranteed under mild conditions. Extensive computational experiments and comparisons with other state-of-the-art methods indicate that the proposed algorithm is very promising. △ Less

Submitted 7 August, 2017; originally announced August 2017.

Showing 1–21 of 21 results for author: Milzarek, A