-
Non-convex Min-Max Optimization: Applications, Challenges, and Recent Theoretical Advances
Authors:
Meisam Razaviyayn,
Tianjian Huang,
Songtao Lu,
Maher Nouiehed,
Maziar Sanjabi,
Mingyi Hong
Abstract:
The min-max optimization problem, also known as the saddle point problem, is a classical optimization problem which is also studied in the context of zero-sum games. Given a class of objective functions, the goal is to find a value for the argument which leads to a small objective value even for the worst case function in the given class. Min-max optimization problems have recently become very pop…
▽ More
The min-max optimization problem, also known as the saddle point problem, is a classical optimization problem which is also studied in the context of zero-sum games. Given a class of objective functions, the goal is to find a value for the argument which leads to a small objective value even for the worst case function in the given class. Min-max optimization problems have recently become very popular in a wide range of signal and data processing applications such as fair beamforming, training generative adversarial networks (GANs), and robust machine learning, to just name a few. The overarching goal of this article is to provide a survey of recent advances for an important subclass of min-max problem, where the minimization and maximization problems can be non-convex and/or non-concave. In particular, we will first present a number of applications to showcase the importance of such min-max problems; then we discuss key theoretical challenges, and provide a selective review of some exciting recent theoretical and algorithmic advances in tackling non-convex min-max problems. Finally, we will point out open questions and future research directions.
△ Less
Submitted 18 August, 2020; v1 submitted 15 June, 2020;
originally announced June 2020.
-
A Trust Region Method for Finding Second-Order Stationarity in Linearly Constrained Non-Convex Optimization
Authors:
Maher Nouiehed,
Meisam Razaviyayn
Abstract:
Motivated by TRACE algorithm [Curtis et al. 2017], we propose a trust region algorithm for finding second order stationary points of a linearly constrained non-convex optimization problem. We show the convergence of the proposed algorithm to (ε_g, ε_H)-second order stationary points in \widetilde{\mathcal{O}}(\max{ε_g^{-3/2}, ε_H^{-3}}) iterations. This iteration complexity is achieved for general…
▽ More
Motivated by TRACE algorithm [Curtis et al. 2017], we propose a trust region algorithm for finding second order stationary points of a linearly constrained non-convex optimization problem. We show the convergence of the proposed algorithm to (ε_g, ε_H)-second order stationary points in \widetilde{\mathcal{O}}(\max{ε_g^{-3/2}, ε_H^{-3}}) iterations. This iteration complexity is achieved for general linearly constrained optimization without cubic regularization of the objective function.
△ Less
Submitted 14 April, 2019;
originally announced April 2019.
-
Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods
Authors:
Maher Nouiehed,
Maziar Sanjabi,
Tianjian Huang,
Jason D. Lee,
Meisam Razaviyayn
Abstract:
Recent applications that arise in machine learning have surged significant interest in solving min-max saddle point games. This problem has been extensively studied in the convex-concave regime for which a global equilibrium solution can be computed efficiently. In this paper, we study the problem in the non-convex regime and show that an \varepsilon--first order stationary point of the game can b…
▽ More
Recent applications that arise in machine learning have surged significant interest in solving min-max saddle point games. This problem has been extensively studied in the convex-concave regime for which a global equilibrium solution can be computed efficiently. In this paper, we study the problem in the non-convex regime and show that an \varepsilon--first order stationary point of the game can be computed when one of the player's objective can be optimized to global optimality efficiently. In particular, we first consider the case where the objective of one of the players satisfies the Polyak-Łojasiewicz (PL) condition. For such a game, we show that a simple multi-step gradient descent-ascent algorithm finds an \varepsilon--first order stationary point of the problem in \widetilde{\mathcal{O}}(\varepsilon^{-2}) iterations. Then we show that our framework can also be applied to the case where the objective of the "max-player" is concave. In this case, we propose a multi-step gradient descent-ascent algorithm that finds an \varepsilon--first order stationary point of the game in \widetilde{\cal O}(\varepsilon^{-3.5}) iterations, which is the best known rate in the literature. We applied our algorithm to a fair classification problem of Fashion-MNIST dataset and observed that the proposed algorithm results in smoother training and better generalization.
△ Less
Submitted 30 October, 2019; v1 submitted 21 February, 2019;
originally announced February 2019.
-
Convergence to Second-Order Stationarity for Constrained Non-Convex Optimization
Authors:
Maher Nouiehed,
Jason D. Lee,
Meisam Razaviyayn
Abstract:
We consider the problem of finding an approximate second-order stationary point of a constrained non-convex optimization problem. We first show that, unlike the gradient descent method for unconstrained optimization, the vanilla projected gradient descent algorithm may converge to a strict saddle point even when there is only a single linear constraint. We then provide a hardness result by showing…
▽ More
We consider the problem of finding an approximate second-order stationary point of a constrained non-convex optimization problem. We first show that, unlike the gradient descent method for unconstrained optimization, the vanilla projected gradient descent algorithm may converge to a strict saddle point even when there is only a single linear constraint. We then provide a hardness result by showing that checking $(ε_g,ε_H)$-second order stationarity is NP-hard even in the presence of linear constraints. Despite our hardness result, we identify instances of the problem for which checking second order stationarity can be done efficiently. For such instances, we propose a dynamic second order Frank--Wolfe algorithm which converges to ($ε_g, ε_H$)-second order stationary points in ${\mathcal{O}}(\max\{ε_g^{-2}, ε_H^{-3}\})$ iterations. The proposed algorithm can be used in general constrained non-convex optimization as long as the constrained quadratic sub-problem can be solved efficiently.
△ Less
Submitted 2 June, 2020; v1 submitted 3 October, 2018;
originally announced October 2018.
-
Learning Deep Models: Critical Points and Local Openness
Authors:
Maher Nouiehed,
Meisam Razaviyayn
Abstract:
With the increasing popularity of non-convex deep models, develo** a unifying theory for studying the optimization problems that arise from training these models becomes very significant. Toward this end, we present in this paper a unifying landscape analysis framework that can be used when the training objective function is the composite of simple functions.
Using the local openness property…
▽ More
With the increasing popularity of non-convex deep models, develo** a unifying theory for studying the optimization problems that arise from training these models becomes very significant. Toward this end, we present in this paper a unifying landscape analysis framework that can be used when the training objective function is the composite of simple functions.
Using the local openness property of the underlying training models, we provide simple sufficient conditions under which any local optimum of the resulting optimization problem is globally optimal. We first completely characterize the local openness of the symmetric and non-symmetric matrix multiplication map** . Then we use our characterization to: 1) provide a simple proof for the classical result of Burer-Monteiro and extend it to non-continuous loss functions. 2) Show that every local optimum of two layer linear networks is globally optimal. Unlike many existing results in the literature, our result requires no assumption on the target data matrix Y, and input data matrix X. 3) Develop a complete characterization of the local/global optima equivalence of multi-layer linear neural networks. We provide various counterexamples to show the necessity of each of our assumptions. 4) Show global/local optima equivalence of over-parameterized non-linear deep models having a certain pyramidal structure. In contrast to existing works, our result requires no assumption on the differentiability of the activation functions and can go beyond "full-rank" cases.
△ Less
Submitted 4 August, 2023; v1 submitted 8 March, 2018;
originally announced March 2018.
-
On the Pervasiveness of Difference-Convexity in Optimization and Statistics
Authors:
Maher Nouiehed,
Jong-Shi Pang,
Meisam Razaviyayn
Abstract:
With the increasing interest in applying the methodology of difference-of-convex (dc) optimization to diverse problems in engineering and statistics, this paper establishes the dc property of many well-known functions not previously known to be of this class. Motivated by a quadratic programming based recourse function in two-stage stochastic programming, we show that the (optimal) value function…
▽ More
With the increasing interest in applying the methodology of difference-of-convex (dc) optimization to diverse problems in engineering and statistics, this paper establishes the dc property of many well-known functions not previously known to be of this class. Motivated by a quadratic programming based recourse function in two-stage stochastic programming, we show that the (optimal) value function of a copositive (thus not necessarily convex) quadratic program is dc on the domain of finiteness of the program when the matrix in the objective function's quadratic term and the constraint matrix are fixed. The proof of this result is based on a dc decomposition of a piecewise LC1 function (i.e., functions with Lipschitz gradients). Armed with these new results and known properties of dc functions existed in the literature, we show that many composite statistical functions in risk analysis, including the value-at-risk (VaR), conditional value-at-risk (CVaR), expectation-based, VaR-based, and CVaR-based random deviation functions are all dc. Adding the known class of dc surrogate sparsity functions that are employed as approximations of the l_0 function in statistical learning, our work significantly expands the family of dc functions and positions them for fruitful applications.
△ Less
Submitted 19 February, 2019; v1 submitted 11 April, 2017;
originally announced April 2017.