Search | arXiv e-print repository

Incentive Systems for Fleets of New Mobility Services

Authors: Ali Ghafelebashi, Meisam Razaviyayn, Maged Dessouky

Abstract: Traffic congestion has become an inevitable challenge in large cities due to population increases and expansion of urban areas. Various approaches are introduced to mitigate traffic issues, encompassing from expanding the road infrastructure to employing demand management. Congestion pricing and incentive schemes are extensively studied for traffic control in traditional networks where each driver… ▽ More Traffic congestion has become an inevitable challenge in large cities due to population increases and expansion of urban areas. Various approaches are introduced to mitigate traffic issues, encompassing from expanding the road infrastructure to employing demand management. Congestion pricing and incentive schemes are extensively studied for traffic control in traditional networks where each driver is a network "player". In this setup, drivers' "selfish" behavior hinders the network from reaching a socially optimal state. In future mobility services, on the other hand, a large portion of drivers/vehicles may be controlled by a small number of companies/organizations. In such a system, offering incentives to organizations can potentially be much more effective in reducing traffic congestion rather than offering incentives directly to drivers. This paper studies the problem of offering incentives to organizations to change the behavior of their individual drivers (or individuals relying on the organization's services). We developed a model where incentives are offered to each organization based on the aggregated travel time loss across all drivers in that organization. Such an incentive offering mechanism requires solving a large-scale optimization problem to minimize the system-level travel time. We propose an efficient algorithm for solving this optimization problem. Numerous experiments on Los Angeles County traffic data reveal the ability of our method to reduce system-level travel time by up to 6.9%. Moreover, our experiments demonstrate that incentivizing organizations can be up to 8 times more efficient than incentivizing individual drivers in terms of incentivization monetary cost. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: 23 pages, 12 figures. arXiv admin note: text overlap with arXiv:2204.07306

arXiv:2306.15056 [pdf, other]

Optimal Differentially Private Model Training with Public Data

Authors: Andrew Lowy, Zeman Li, Tianjian Huang, Meisam Razaviyayn

Abstract: Differential privacy (DP) ensures that training a machine learning model does not leak private data. In practice, we may have access to auxiliary public data that is free of privacy concerns. In this work, we assume access to a given amount of public data and settle the following fundamental open questions: 1. What is the optimal (worst-case) error of a DP model trained over a private data set whi… ▽ More Differential privacy (DP) ensures that training a machine learning model does not leak private data. In practice, we may have access to auxiliary public data that is free of privacy concerns. In this work, we assume access to a given amount of public data and settle the following fundamental open questions: 1. What is the optimal (worst-case) error of a DP model trained over a private data set while having access to side public data? 2. How can we harness public data to improve DP model training in practice? We consider these questions in both the local and central models of pure and approximate DP. To answer the first question, we prove tight (up to log factors) lower and upper bounds that characterize the optimal error rates of three fundamental problems: mean estimation, empirical risk minimization, and stochastic convex optimization. We show that the optimal error rates can be attained (up to log factors) by either discarding private data and training a public model, or treating public data like it is private and using an optimal DP algorithm. To address the second question, we develop novel algorithms that are "even more optimal" (i.e. better constants) than the asymptotically optimal approaches described above. For local DP mean estimation, our algorithm is \ul{optimal including constants}. Empirically, our algorithms show benefits over the state-of-the-art. △ Less

Submitted 13 February, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

Comments: V2 changed the title and added high-dimensional approximate semi-DP lower bounds

arXiv:2303.08431 [pdf, other]

Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators

Authors: Yinbin Han, Meisam Razaviyayn, Renyuan Xu

Abstract: Nonlinear control systems with partial information to the decision maker are prevalent in a variety of applications. As a step toward studying such nonlinear systems, this work explores reinforcement learning methods for finding the optimal policy in the nearly linear-quadratic regulator systems. In particular, we consider a dynamic system that combines linear and nonlinear components, and is gove… ▽ More Nonlinear control systems with partial information to the decision maker are prevalent in a variety of applications. As a step toward studying such nonlinear systems, this work explores reinforcement learning methods for finding the optimal policy in the nearly linear-quadratic regulator systems. In particular, we consider a dynamic system that combines linear and nonlinear components, and is governed by a policy with the same structure. Assuming that the nonlinear component comprises kernels with small Lipschitz coefficients, we characterize the optimization landscape of the cost function. Although the cost function is nonconvex in general, we establish the local strong convexity and smoothness in the vicinity of the global optimizer. Additionally, we propose an initialization mechanism to leverage these properties. Building on the developments, we design a policy gradient algorithm that is guaranteed to converge to the globally optimal policy with a linear rate. △ Less

Submitted 16 February, 2024; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: 34 pages

arXiv:2209.11920 [pdf, other]

Tradeoffs between convergence rate and noise amplification for momentum-based accelerated optimization algorithms

Authors: Hesameddin Mohammadi, Meisam Razaviyayn, Mihailo R. Jovanović

Abstract: We study momentum-based first-order optimization algorithms in which the iterations utilize information from the two previous steps and are subject to an additive white noise. This setup uses noise to account for uncertainty in either gradient evaluation or iteration updates, and it includes Polyak's heavy-ball and Nesterov's accelerated methods as special cases. For strongly convex quadratic prob… ▽ More We study momentum-based first-order optimization algorithms in which the iterations utilize information from the two previous steps and are subject to an additive white noise. This setup uses noise to account for uncertainty in either gradient evaluation or iteration updates, and it includes Polyak's heavy-ball and Nesterov's accelerated methods as special cases. For strongly convex quadratic problems, we use the steady-state variance of the error in the optimization variable to quantify noise amplification and identify fundamental stochastic performance tradeoffs. Our approach utilizes the Jury stability criterion to provide a novel geometric characterization of conditions for linear convergence, and it reveals the relation between the noise amplification and convergence rate as well as their dependence on the condition number and the constant algorithmic parameters. This geometric insight leads to simple alternative proofs of standard convergence results and allows us to establish ``uncertainty principle'' of strongly convex optimization: for the two-step momentum method with linear convergence rate, the lower bound on the product between the settling time and noise amplification scales quadratically with the condition number. Our analysis also identifies a key difference between the gradient and iterate noise models: while the amplification of gradient noise can be made arbitrarily small by sufficiently decelerating the algorithm, the best achievable variance for the iterate noise model increases linearly with the settling time in the decelerating regime. Finally, we introduce two parameterized families of algorithms that strike a balance between noise amplification and settling time while preserving order-wise Pareto optimality for both noise models. △ Less

Submitted 19 June, 2024; v1 submitted 24 September, 2022; originally announced September 2022.

Comments: 23 pages; 7 figures

arXiv:2209.07403 [pdf, other]

Private Stochastic Optimization With Large Worst-Case Lipschitz Parameter: Optimal Rates for (Non-Smooth) Convex Losses and Extension to Non-Convex Losses

Authors: Andrew Lowy, Meisam Razaviyayn

Abstract: We study differentially private (DP) stochastic optimization (SO) with loss functions whose worst-case Lipschitz parameter over all data points may be extremely large. To date, the vast majority of work on DP SO assumes that the loss is uniformly Lipschitz continuous over data (i.e. stochastic gradients are uniformly bounded over all data points). While this assumption is convenient, it often lead… ▽ More We study differentially private (DP) stochastic optimization (SO) with loss functions whose worst-case Lipschitz parameter over all data points may be extremely large. To date, the vast majority of work on DP SO assumes that the loss is uniformly Lipschitz continuous over data (i.e. stochastic gradients are uniformly bounded over all data points). While this assumption is convenient, it often leads to pessimistic excess risk bounds. In many practical problems, the worst-case (uniform) Lipschitz parameter of the loss over all data points may be extremely large due to outliers. In such cases, the error bounds for DP SO, which scale with the worst-case Lipschitz parameter of the loss, are vacuous. To address these limitations, this work provides near-optimal excess risk bounds that do not depend on the uniform Lipschitz parameter of the loss. Building on a recent line of work (Wang et al., 2020; Kamath et al., 2022), we assume that stochastic gradients have bounded $k$-th order moments for some $k \geq 2$. Compared with works on uniformly Lipschitz DP SO, our excess risk scales with the $k$-th moment bound instead of the uniform Lipschitz parameter of the loss, allowing for significantly faster rates in the presence of outliers and/or heavy-tailed data. For convex and strongly convex loss functions, we provide the first asymptotically optimal excess risk bounds (up to a logarithmic factor). In contrast to (Wang et al., 2020; Kamath et al., 2022), our bounds do not require the loss function to be differentiable/smooth. We also devise a linear-time algorithm for smooth losses that has excess risk that is tight in certain practical parameter regimes. Additionally, our work is the first to address non-convex non-uniformly Lipschitz loss functions satisfying the Proximal-PL inequality; this covers some practical machine learning models. Our Proximal-PL algorithm has near-optimal excess risk. △ Less

Submitted 27 October, 2023; v1 submitted 15 September, 2022; originally announced September 2022.

Comments: Appeared in the International Conference on Algorithmic Learning Theory (ALT) 2023. This version improves the runtime bound in Theorem 6

arXiv:2204.07306 [pdf, other]

Congestion Reduction via Personalized Incentives

Authors: Ali Ghafelebashi, Meisam Razaviyayn, Maged Dessouky

Abstract: With rapid population growth and urban development, traffic congestion has become an inescapable issue, especially in large cities. Many congestion reduction strategies have been proposed in the past, ranging from roadway extension to transportation demand management. In particular, congestion pricing schemes have been used as negative reinforcements for traffic control. In this project, we study… ▽ More With rapid population growth and urban development, traffic congestion has become an inescapable issue, especially in large cities. Many congestion reduction strategies have been proposed in the past, ranging from roadway extension to transportation demand management. In particular, congestion pricing schemes have been used as negative reinforcements for traffic control. In this project, we study an alternative approach of offering positive incentives to drivers to take different routes. More specifically, we propose an algorithm to reduce traffic congestion and improve routing efficiency via offering personalized incentives to drivers. We exploit the wide-accessibility of smart devices to communicate with drivers and develop an incentive offering mechanism using individuals' preferences and aggregate traffic information. The incentives are offered after solving a large-scale optimization problem in order to minimize the total travel time (or minimize any cost function of the network such as total Carbon emission). Since this massive size optimization problem needs to be solved continually in the network, we developed a distributed computational approach. The proposed distributed algorithm is guaranteed to converge under a mild set of assumptions that are verified with real data. We evaluated the performance of our algorithm using traffic data from the Los Angeles area. Our experiments show congestion reduction of up to 11% in arterial roads and highways. △ Less

Submitted 14 April, 2022; originally announced April 2022.

Comments: 24 pages, 8 figures

arXiv:2203.06735 [pdf, other]

Private Non-Convex Federated Learning Without a Trusted Server

Authors: Andrew Lowy, Ali Ghafelebashi, Meisam Razaviyayn

Abstract: We study federated learning (FL) -- especially cross-silo FL -- with non-convex loss functions and data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) must protect the privacy of each person's data (e.g. patient's medical record), even if the server or other silos act as adversarial eavesdroppers. To that end, we consider inter-silo record-level… ▽ More We study federated learning (FL) -- especially cross-silo FL -- with non-convex loss functions and data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) must protect the privacy of each person's data (e.g. patient's medical record), even if the server or other silos act as adversarial eavesdroppers. To that end, we consider inter-silo record-level (ISRL) differential privacy (DP), which requires silo~$i$'s communications to satisfy record/item-level DP. We propose novel ISRL-DP algorithms for FL with heterogeneous (non-i.i.d.) silo data and two classes of Lipschitz continuous loss functions: First, we consider losses satisfying the Proximal Polyak-Lojasiewicz (PL) inequality, which is an extension of the classical PL condition to the constrained setting. In contrast to our result, prior works only considered unconstrained private optimization with Lipschitz PL loss, which rules out most interesting PL losses such as strongly convex problems and linear/logistic regression. Our algorithms nearly attain the optimal strongly convex, homogeneous (i.i.d.) rate for ISRL-DP FL without assuming convexity or i.i.d. data. Second, we give the first private algorithms for non-convex non-smooth loss functions. Our utility bounds even improve on the state-of-the-art bounds for smooth losses. We complement our upper bounds with lower bounds. Additionally, we provide shuffle DP (SDP) algorithms that improve over the state-of-the-art central DP algorithms under more practical trust assumptions. Numerical experiments show that our algorithm has better accuracy than baselines for most privacy levels. All the codes are publicly available at: https://github.com/ghafeleb/Private-NonConvex-Federated-Learning-Without-a-Trusted-Server. △ Less

Submitted 25 June, 2023; v1 submitted 13 March, 2022; originally announced March 2022.

Comments: AISTATS 2023

arXiv:2110.03950 [pdf, ps, other]

Nonconvex-Nonconcave Min-Max Optimization with a Small Maximization Domain

Authors: Dmitrii M. Ostrovskii, Babak Barazandeh, Meisam Razaviyayn

Abstract: We study the problem of finding approximate first-order stationary points in optimization problems of the form $\min_{x \in X} \max_{y \in Y} f(x,y)$, where the sets $X,Y$ are convex and $Y$ is compact. The objective function $f$ is smooth, but assumed neither convex in $x$ nor concave in $y$. Our approach relies upon replacing the function $f(x,\cdot)$ with its $k$th order Taylor approximation (i… ▽ More We study the problem of finding approximate first-order stationary points in optimization problems of the form $\min_{x \in X} \max_{y \in Y} f(x,y)$, where the sets $X,Y$ are convex and $Y$ is compact. The objective function $f$ is smooth, but assumed neither convex in $x$ nor concave in $y$. Our approach relies upon replacing the function $f(x,\cdot)$ with its $k$th order Taylor approximation (in $y$) and finding a near-stationary point in the resulting surrogate problem. To guarantee its success, we establish the following result: let the Euclidean diameter of $Y$ be small in terms of the target accuracy $\varepsilon$, namely $O(\varepsilon^{\frac{2}{k+1}})$ for $k \in \mathbb{N}$ and $O(\varepsilon)$ for $k = 0$, with the constant factors controlled by certain regularity parameters of $f$; then any $\varepsilon$-stationary point in the surrogate problem remains $O(\varepsilon)$-stationary for the initial problem. Moreover, we show that these upper bounds are nearly optimal: the aforementioned reduction provably fails when the diameter of $Y$ is larger. For $0 \le k \le 2$ the surrogate function can be efficiently maximized in $y$; our general approximation result then leads to efficient algorithms for finding a near-stationary point in nonconvex-nonconcave min-max problems, for which we also provide convergence guarantees. △ Less

Submitted 8 October, 2021; originally announced October 2021.

Comments: 50 pages

arXiv:2106.09779 [pdf, other]

Private Federated Learning Without a Trusted Server: Optimal Algorithms for Convex Losses

Authors: Andrew Lowy, Meisam Razaviyayn

Abstract: This paper studies federated learning (FL)--especially cross-silo FL--with data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) has data from different people (e.g. patients) and must maintain the privacy of each person's data (e.g. medical record), even if the server or other silos act as adversarial eavesdroppers. This requirement motivates the… ▽ More This paper studies federated learning (FL)--especially cross-silo FL--with data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) has data from different people (e.g. patients) and must maintain the privacy of each person's data (e.g. medical record), even if the server or other silos act as adversarial eavesdroppers. This requirement motivates the study of Inter-Silo Record-Level Differential Privacy (ISRL-DP), which requires silo i's communications to satisfy record/item-level differential privacy (DP). ISRL-DP ensures that the data of each person (e.g. patient) in silo i (e.g. hospital i) cannot be leaked. ISRL-DP is different from well-studied privacy notions. Central and user-level DP assume that people trust the server/other silos. On the other end of the spectrum, local DP assumes that people do not trust anyone at all (even their own silo). Sitting between central and local DP, ISRL-DP makes the realistic assumption (in cross-silo FL) that people trust their own silo, but not the server or other silos. In this work, we provide tight (up to logarithms) upper and lower bounds for ISRL-DP FL with convex/strongly convex loss functions and homogeneous (i.i.d.) silo data. Remarkably, we show that similar bounds are attainable for smooth losses with arbitrary heterogeneous silo data distributions, via an accelerated ISRL-DP algorithm. We also provide tight upper and lower bounds for ISRL-DP federated empirical risk minimization, and use acceleration to attain the optimal bounds in fewer rounds of communication than the state-of-the-art. Finally, with a secure "shuffler" to anonymize silo messages (but without a trusted server), our algorithm attains the optimal central DP rates under more practical trust assumptions. Numerical experiments show favorable privacy-accuracy tradeoffs for our algorithm in classification and regression tasks. △ Less

Submitted 14 June, 2023; v1 submitted 17 June, 2021; originally announced June 2021.

Comments: ICLR 2023

arXiv:2105.05953 [pdf, other]

Efficient Algorithms for Estimating the Parameters of Mixed Linear Regression Models

Authors: Babak Barazandeh, Ali Ghafelebashi, Meisam Razaviyayn, Ram Sriharsha

Abstract: Mixed linear regression (MLR) model is among the most exemplary statistical tools for modeling non-linear distributions using a mixture of linear models. When the additive noise in MLR model is Gaussian, Expectation-Maximization (EM) algorithm is a widely-used algorithm for maximum likelihood estimation of MLR parameters. However, when noise is non-Gaussian, the steps of EM algorithm may not have… ▽ More Mixed linear regression (MLR) model is among the most exemplary statistical tools for modeling non-linear distributions using a mixture of linear models. When the additive noise in MLR model is Gaussian, Expectation-Maximization (EM) algorithm is a widely-used algorithm for maximum likelihood estimation of MLR parameters. However, when noise is non-Gaussian, the steps of EM algorithm may not have closed-form update rules, which makes EM algorithm impractical. In this work, we study the maximum likelihood estimation of the parameters of MLR model when the additive noise has non-Gaussian distribution. In particular, we consider the case that noise has Laplacian distribution and we first show that unlike the the Gaussian case, the resulting sub-problems of EM algorithm in this case does not have closed-form update rule, thus preventing us from using EM in this case. To overcome this issue, we propose a new algorithm based on combining the alternating direction method of multipliers (ADMM) with EM algorithm idea. Our numerical experiments show that our method outperforms the EM algorithm in statistical accuracy and computational time in non-Gaussian noise case. △ Less

Submitted 12 May, 2021; originally announced May 2021.

arXiv:2012.02901 [pdf, other]

Near-Optimal Procedures for Model Discrimination with Non-Disclosure Properties

Authors: Dmitrii M. Ostrovskii, Mohamed Ndaoud, Adel Javanmard, Meisam Razaviyayn

Abstract: Let $θ_0,θ_1 \in \mathbb{R}^d$ be the population risk minimizers associated to some loss $\ell:\mathbb{R}^d\times \mathcal{Z}\to\mathbb{R}$ and two distributions $\mathbb{P}_0,\mathbb{P}_1$ on $\mathcal{Z}$. The models $θ_0,θ_1$ are unknown, and $\mathbb{P}_0,\mathbb{P}_1$ can be accessed by drawing i.i.d samples from them. Our work is motivated by the following model discrimination question: "Wha… ▽ More Let $θ_0,θ_1 \in \mathbb{R}^d$ be the population risk minimizers associated to some loss $\ell:\mathbb{R}^d\times \mathcal{Z}\to\mathbb{R}$ and two distributions $\mathbb{P}_0,\mathbb{P}_1$ on $\mathcal{Z}$. The models $θ_0,θ_1$ are unknown, and $\mathbb{P}_0,\mathbb{P}_1$ can be accessed by drawing i.i.d samples from them. Our work is motivated by the following model discrimination question: "What sizes of the samples from $\mathbb{P}_0$ and $\mathbb{P}_1$ allow to distinguish between the two hypotheses $θ^*=θ_0$ and $θ^*=θ_1$ for given $θ^*\in\{θ_0,θ_1\}$?" Making the first steps towards answering it in full generality, we first consider the case of a well-specified linear model with squared loss. Here we provide matching upper and lower bounds on the sample complexity as given by $\min\{1/Δ^2,\sqrt{r}/Δ\}$ up to a constant factor; here $Δ$ is a measure of separation between $\mathbb{P}_0$ and $\mathbb{P}_1$ and $r$ is the rank of the design covariance matrix. We then extend this result in two directions: (i) for general parametric models in asymptotic regime; (ii) for generalized linear models in small samples ($n\le r$) under weak moment assumptions. In both cases we derive sample complexity bounds of a similar form while allowing for model misspecification. In fact, our testing procedures only access $θ^*$ via a certain functional of empirical risk. In addition, the number of observations that allows us to reach statistical confidence does not allow to "resolve" the two models $-$ that is, recover $θ_0,θ_1$ up to $O(Δ)$ prediction accuracy. These two properties allow to use our framework in applied tasks where one would like to $\textit{identify}$ a prediction model, which can be proprietary, while guaranteeing that the model cannot be actually $\textit{inferred}$ by the identifying agent. △ Less

Submitted 10 July, 2021; v1 submitted 4 December, 2020; originally announced December 2020.

Comments: 52 pages, 2 figures; corrected the proof of the lower bound; added new applications and the Fisher information-based argument in Appendix F

arXiv:2009.03482 [pdf, ps, other]

Alternating Direction Method of Multipliers for Quantization

Authors: Tianjian Huang, Prajwal Singhania, Maziar Sanjabi, Pabitra Mitra, Meisam Razaviyayn

Abstract: Quantization of the parameters of machine learning models, such as deep neural networks, requires solving constrained optimization problems, where the constraint set is formed by the Cartesian product of many simple discrete sets. For such optimization problems, we study the performance of the Alternating Direction Method of Multipliers for Quantization ($\texttt{ADMM-Q}$) algorithm, which is a va… ▽ More Quantization of the parameters of machine learning models, such as deep neural networks, requires solving constrained optimization problems, where the constraint set is formed by the Cartesian product of many simple discrete sets. For such optimization problems, we study the performance of the Alternating Direction Method of Multipliers for Quantization ($\texttt{ADMM-Q}$) algorithm, which is a variant of the widely-used ADMM method applied to our discrete optimization problem. We establish the convergence of the iterates of $\texttt{ADMM-Q}$ to certain $\textit{stationary points}$. To the best of our knowledge, this is the first analysis of an ADMM-type method for problems with discrete variables/constraints. Based on our theoretical insights, we develop a few variants of $\texttt{ADMM-Q}$ that can handle inexact update rules, and have improved performance via the use of "soft projection" and "injecting randomness to the algorithm". We empirically evaluate the efficacy of our proposed approaches. △ Less

Submitted 1 March, 2021; v1 submitted 7 September, 2020; originally announced September 2020.

arXiv:2006.08141 [pdf, other]

doi 10.1109/MSP.2020.3003851

Non-convex Min-Max Optimization: Applications, Challenges, and Recent Theoretical Advances

Authors: Meisam Razaviyayn, Tianjian Huang, Songtao Lu, Maher Nouiehed, Maziar Sanjabi, Mingyi Hong

Abstract: The min-max optimization problem, also known as the saddle point problem, is a classical optimization problem which is also studied in the context of zero-sum games. Given a class of objective functions, the goal is to find a value for the argument which leads to a small objective value even for the worst case function in the given class. Min-max optimization problems have recently become very pop… ▽ More The min-max optimization problem, also known as the saddle point problem, is a classical optimization problem which is also studied in the context of zero-sum games. Given a class of objective functions, the goal is to find a value for the argument which leads to a small objective value even for the worst case function in the given class. Min-max optimization problems have recently become very popular in a wide range of signal and data processing applications such as fair beamforming, training generative adversarial networks (GANs), and robust machine learning, to just name a few. The overarching goal of this article is to provide a survey of recent advances for an important subclass of min-max problem, where the minimization and maximization problems can be non-convex and/or non-concave. In particular, we will first present a number of applications to showcase the importance of such min-max problems; then we discuss key theoretical challenges, and provide a selective review of some exciting recent theoretical and algorithmic advances in tackling non-convex min-max problems. Finally, we will point out open questions and future research directions. △ Less

Submitted 18 August, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

Journal ref: IEEE Signal Processing Magazine (Volume: 37, Issue: 5, Sept. 2020)

arXiv:2003.08093 [pdf, other]

Solving Non-Convex Non-Differentiable Min-Max Games using Proximal Gradient Method

Authors: Babak Barazandeh, Meisam Razaviyayn

Abstract: Min-max saddle point games appear in a wide range of applications in machine leaning and signal processing. Despite their wide applicability, theoretical studies are mostly limited to the special convex-concave structure. While some recent works generalized these results to special smooth non-convex cases, our understanding of non-smooth scenarios is still limited. In this work, we study special f… ▽ More Min-max saddle point games appear in a wide range of applications in machine leaning and signal processing. Despite their wide applicability, theoretical studies are mostly limited to the special convex-concave structure. While some recent works generalized these results to special smooth non-convex cases, our understanding of non-smooth scenarios is still limited. In this work, we study special form of non-smooth min-max games when the objective function is (strongly) convex with respect to one of the player's decision variable. We show that a simple multi-step proximal gradient descent-ascent algorithm converges to $ε$-first-order Nash equilibrium of the min-max game with the number of gradient evaluations being polynomial in $1/ε$. We will also show that our notion of stationarity is stronger than existing ones in the literature. Finally, we evaluate the performance of the proposed algorithm through adversarial attack on a LASSO estimator. △ Less

Submitted 18 March, 2020; originally announced March 2020.

arXiv:2002.07919 [pdf, other]

Efficient Search of First-Order Nash Equilibria in Nonconvex-Concave Smooth Min-Max Problems

Authors: Dmitrii M. Ostrovskii, Andrew Lowy, Meisam Razaviyayn

Abstract: We propose an efficient algorithm for finding first-order Nash equilibria in min-max problems of the form $\min_{x \in X}\max_{y\in Y} F(x,y)$, where the objective function is smooth in both variables and concave with respect to $y$; the sets $X$ and $Y$ are convex and "projection-friendly," and $Y$ is compact. Our goal is to find an $(\varepsilon_x,\varepsilon_y)$-first-order Nash equilibrium wit… ▽ More We propose an efficient algorithm for finding first-order Nash equilibria in min-max problems of the form $\min_{x \in X}\max_{y\in Y} F(x,y)$, where the objective function is smooth in both variables and concave with respect to $y$; the sets $X$ and $Y$ are convex and "projection-friendly," and $Y$ is compact. Our goal is to find an $(\varepsilon_x,\varepsilon_y)$-first-order Nash equilibrium with respect to a stationarity criterion that is stronger than the commonly used proximal gradient norm. The proposed approach is fairly simple: we perform approximate proximal-point iterations on the primal function, with inexact oracle provided by Nesterov's algorithm run on the regularized function $F(x_t,\cdot)$, $x_t$ being the current primal iterate. The resulting iteration complexity is $O(\varepsilon_x{}^{-2} \varepsilon_y{}^{-1/2})$ up to a logarithmic factor. As a byproduct, the choice $\varepsilon_y = O(\varepsilon_x{}^2)$ allows for the $O(\varepsilon_x{}^{-3})$ complexity of finding an $\varepsilon_x$-stationary point for the standard Moreau envelope of the primal function. Moreover, when the objective is strongly concave with respect to $y$, the complexity estimate for our algorithm improves to $O(\varepsilon_x{}^{-2}{κ_y}^{1/2})$ up to a logarithmic factor, where $κ_y$ is the condition number appropriately adjusted for coupling. In both scenarios, the complexity estimates are the best known so far, and are only known for the (weaker) proximal gradient norm criterion. Meanwhile, our approach is "user-friendly:" (i) the algorithm is built upon running a variant of Nesterov's accelerated algorithm as subroutine and avoids extragradient steps; (ii) the convergence analysis recycles the well-known results on accelerated methods with inexact oracle. Finally, we extend the approach to non-Euclidean proximal geometries. △ Less

Submitted 2 May, 2021; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: 29 pages; accepted to SIAM Journal on Optimization (as of May 2021)

MSC Class: 90C06; 90C25; 90C26; 91A99

arXiv:2001.07819 [pdf, other]

Zeroth-Order Algorithms for Nonconvex Minimax Problems with Improved Complexities

Authors: Zhongruo Wang, Krishnakumar Balasubramanian, Shiqian Ma, Meisam Razaviyayn

Abstract: In this paper, we study zeroth-order algorithms for minimax optimization problems that are nonconvex in one variable and strongly-concave in the other variable. Such minimax optimization problems have attracted significant attention lately due to their applications in modern machine learning tasks. We first consider a deterministic version of the problem. We design and analyze the Zeroth-Order Gra… ▽ More In this paper, we study zeroth-order algorithms for minimax optimization problems that are nonconvex in one variable and strongly-concave in the other variable. Such minimax optimization problems have attracted significant attention lately due to their applications in modern machine learning tasks. We first consider a deterministic version of the problem. We design and analyze the Zeroth-Order Gradient Descent Ascent (\texttt{ZO-GDA}) algorithm, and provide improved results compared to existing works, in terms of oracle complexity. We also propose the Zeroth-Order Gradient Descent Multi-Step Ascent (\texttt{ZO-GDMSA}) algorithm that significantly improves the oracle complexity of \texttt{ZO-GDA}. We then consider stochastic versions of \texttt{ZO-GDA} and \texttt{ZO-GDMSA}, to handle stochastic nonconvex minimax problems. For this case, we provide oracle complexity results under two assumptions on the stochastic gradient: (i) the uniformly bounded variance assumption, which is common in traditional stochastic optimization, and (ii) the Strong Growth Condition (SGC), which has been known to be satisfied by modern over-parametrized machine learning models. We establish that under the SGC assumption, the complexities of the stochastic algorithms match that of deterministic algorithms. Numerical experiments are presented to support our theoretical results. △ Less

Submitted 4 April, 2022; v1 submitted 21 January, 2020; originally announced January 2020.

Comments: To appear in the Journal of Global Optimization

arXiv:1907.04450 [pdf, ps, other]

SNAP: Finding Approximate Second-Order Stationary Solutions Efficiently for Non-convex Linearly Constrained Problems

Authors: Songtao Lu, Meisam Razaviyayn, Bo Yang, Kejun Huang, Mingyi Hong

Abstract: This paper proposes low-complexity algorithms for finding approximate second-order stationary points (SOSPs) of problems with smooth non-convex objective and linear constraints. While finding (approximate) SOSPs is computationally intractable, we first show that generic instances of the problem can be solved efficiently. More specifically, for a generic problem instance, certain strict complementa… ▽ More This paper proposes low-complexity algorithms for finding approximate second-order stationary points (SOSPs) of problems with smooth non-convex objective and linear constraints. While finding (approximate) SOSPs is computationally intractable, we first show that generic instances of the problem can be solved efficiently. More specifically, for a generic problem instance, certain strict complementarity (SC) condition holds for all Karush-Kuhn-Tucker (KKT) solutions (with probability one). The SC condition is then used to establish an equivalence relationship between two different notions of SOSPs, one of which is computationally easy to verify. Based on this particular notion of SOSP, we design an algorithm named the Successive Negative-curvature grAdient Projection (SNAP), which successively performs either conventional gradient projection or some negative curvature based projection steps to find SOSPs. SNAP and its first-order extension SNAP$^+$, require $\mathcal{O}(1/ε^{2.5})$ iterations to compute an $(ε, \sqrtε)$-SOSP, and their per-iteration computational complexities are polynomial in the number of constraints and problem dimension. To our knowledge, this is the first time that first-order algorithms with polynomial per-iteration complexity and global sublinear rate have been designed to find SOSPs of the important class of non-convex problems with linear constraints. △ Less

Submitted 9 July, 2019; originally announced July 2019.

arXiv:1905.11011 [pdf, other]

Robustness of accelerated first-order algorithms for strongly convex optimization problems

Authors: Hesameddin Mohammadi, Meisam Razaviyayn, Mihailo R. Jovanović

Abstract: We study the robustness of accelerated first-order algorithms to stochastic uncertainties in gradient evaluation. Specifically, for unconstrained, smooth, strongly convex optimization problems, we examine the mean-squared error in the optimization variable when the iterates are perturbed by additive white noise. This type of uncertainty may arise in situations where an approximation of the gradien… ▽ More We study the robustness of accelerated first-order algorithms to stochastic uncertainties in gradient evaluation. Specifically, for unconstrained, smooth, strongly convex optimization problems, we examine the mean-squared error in the optimization variable when the iterates are perturbed by additive white noise. This type of uncertainty may arise in situations where an approximation of the gradient is sought through measurements of a real system or in a distributed computation over a network. Even though the underlying dynamics of first-order algorithms for this class of problems are nonlinear, we establish upper bounds on the mean-squared deviation from the optimal solution that are tight up to constant factors. Our analysis quantifies fundamental trade-offs between noise amplification and convergence rates obtained via any acceleration scheme similar to Nesterov's or heavy-ball methods. To gain additional analytical insight, for strongly convex quadratic problems, we explicitly evaluate the steady-state variance of the optimization variable in terms of the eigenvalues of the Hessian of the objective function. We demonstrate that the entire spectrum of the Hessian, rather than just the extreme eigenvalues, influence robustness of noisy algorithms. We specialize this result to the problem of distributed averaging over undirected networks and examine the role of network size and topology on the robustness of noisy accelerated algorithms. △ Less

Submitted 20 February, 2020; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: 45 pages, 6 figures

arXiv:1904.06784 [pdf, ps, other]

A Trust Region Method for Finding Second-Order Stationarity in Linearly Constrained Non-Convex Optimization

Authors: Maher Nouiehed, Meisam Razaviyayn

Abstract: Motivated by TRACE algorithm [Curtis et al. 2017], we propose a trust region algorithm for finding second order stationary points of a linearly constrained non-convex optimization problem. We show the convergence of the proposed algorithm to (ε_g, ε_H)-second order stationary points in \widetilde{\mathcal{O}}(\max{ε_g^{-3/2}, ε_H^{-3}}) iterations. This iteration complexity is achieved for general… ▽ More Motivated by TRACE algorithm [Curtis et al. 2017], we propose a trust region algorithm for finding second order stationary points of a linearly constrained non-convex optimization problem. We show the convergence of the proposed algorithm to (ε_g, ε_H)-second order stationary points in \widetilde{\mathcal{O}}(\max{ε_g^{-3/2}, ε_H^{-3}}) iterations. This iteration complexity is achieved for general linearly constrained optimization without cubic regularization of the objective function. △ Less

Submitted 14 April, 2019; originally announced April 2019.

arXiv:1902.08297 [pdf, other]

Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods

Authors: Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D. Lee, Meisam Razaviyayn

Abstract: Recent applications that arise in machine learning have surged significant interest in solving min-max saddle point games. This problem has been extensively studied in the convex-concave regime for which a global equilibrium solution can be computed efficiently. In this paper, we study the problem in the non-convex regime and show that an \varepsilon--first order stationary point of the game can b… ▽ More Recent applications that arise in machine learning have surged significant interest in solving min-max saddle point games. This problem has been extensively studied in the convex-concave regime for which a global equilibrium solution can be computed efficiently. In this paper, we study the problem in the non-convex regime and show that an \varepsilon--first order stationary point of the game can be computed when one of the player's objective can be optimized to global optimality efficiently. In particular, we first consider the case where the objective of one of the players satisfies the Polyak-Łojasiewicz (PL) condition. For such a game, we show that a simple multi-step gradient descent-ascent algorithm finds an \varepsilon--first order stationary point of the problem in \widetilde{\mathcal{O}}(\varepsilon^{-2}) iterations. Then we show that our framework can also be applied to the case where the objective of the "max-player" is concave. In this case, we propose a multi-step gradient descent-ascent algorithm that finds an \varepsilon--first order stationary point of the game in \widetilde{\cal O}(\varepsilon^{-3.5}) iterations, which is the best known rate in the literature. We applied our algorithm to a fair classification problem of Fashion-MNIST dataset and observed that the proposed algorithm results in smoother training and better generalization. △ Less

Submitted 30 October, 2019; v1 submitted 21 February, 2019; originally announced February 2019.

arXiv:1812.02878 [pdf, ps, other]

Solving Non-Convex Non-Concave Min-Max Games Under Polyak-Łojasiewicz Condition

Authors: Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee

Abstract: In this short note, we consider the problem of solving a min-max zero-sum game. This problem has been extensively studied in the convex-concave regime where the global solution can be computed efficiently. Recently, there have also been developments for finding the first order stationary points of the game when one of the player's objective is concave or (weakly) concave. This work focuses on the… ▽ More In this short note, we consider the problem of solving a min-max zero-sum game. This problem has been extensively studied in the convex-concave regime where the global solution can be computed efficiently. Recently, there have also been developments for finding the first order stationary points of the game when one of the player's objective is concave or (weakly) concave. This work focuses on the non-convex non-concave regime where the objective of one of the players satisfies Polyak-Łojasiewicz (PL) Condition. For such a game, we show that a simple multi-step gradient descent-ascent algorithm finds an $\varepsilon$--first order stationary point of the problem in $\widetilde{\mathcal{O}}(\varepsilon^{-2})$ iterations. △ Less

Submitted 6 December, 2018; originally announced December 2018.

arXiv:1810.05251 [pdf, other]

A Linearly Convergent Doubly Stochastic Gauss-Seidel Algorithm for Solving Linear Equations and A Certain Class of Over-Parameterized Optimization Problems

Authors: Meisam Razaviyayn, Mingyi Hong, Navid Reyhanian, Zhi-Quan Luo

Abstract: Consider the classical problem of solving a general linear system of equations $Ax=b$. It is well known that the (successively over relaxed) Gauss-Seidel scheme and many of its variants may not converge when $A$ is neither diagonally dominant nor symmetric positive definite. Can we have a linearly convergent G-S type algorithm that works for {\it any} $A$? In this paper we answer this question aff… ▽ More Consider the classical problem of solving a general linear system of equations $Ax=b$. It is well known that the (successively over relaxed) Gauss-Seidel scheme and many of its variants may not converge when $A$ is neither diagonally dominant nor symmetric positive definite. Can we have a linearly convergent G-S type algorithm that works for {\it any} $A$? In this paper we answer this question affirmatively by proposing a doubly stochastic G-S algorithm that is provably linearly convergent (in the mean square error sense) for any feasible linear system of equations. The key in the algorithm design is to introduce a {\it nonuniform double stochastic} scheme for picking the equation and the variable in each update step as well as a stepsize rule. These techniques also generalize to certain iterative alternating projection algorithms for solving the linear feasibility problem $A x\le b$ with an arbitrary $A$, as well as high-dimensional minimization problems for training over-parameterized models in machine learning. Our results demonstrate that a carefully designed randomization scheme can make an otherwise divergent G-S algorithm converge. △ Less

Submitted 13 May, 2019; v1 submitted 11 October, 2018; originally announced October 2018.

arXiv:1810.02024 [pdf, other]

Convergence to Second-Order Stationarity for Constrained Non-Convex Optimization

Authors: Maher Nouiehed, Jason D. Lee, Meisam Razaviyayn

Abstract: We consider the problem of finding an approximate second-order stationary point of a constrained non-convex optimization problem. We first show that, unlike the gradient descent method for unconstrained optimization, the vanilla projected gradient descent algorithm may converge to a strict saddle point even when there is only a single linear constraint. We then provide a hardness result by showing… ▽ More We consider the problem of finding an approximate second-order stationary point of a constrained non-convex optimization problem. We first show that, unlike the gradient descent method for unconstrained optimization, the vanilla projected gradient descent algorithm may converge to a strict saddle point even when there is only a single linear constraint. We then provide a hardness result by showing that checking $(ε_g,ε_H)$-second order stationarity is NP-hard even in the presence of linear constraints. Despite our hardness result, we identify instances of the problem for which checking second order stationarity can be done efficiently. For such instances, we propose a dynamic second order Frank--Wolfe algorithm which converges to ($ε_g, ε_H$)-second order stationary points in ${\mathcal{O}}(\max\{ε_g^{-2}, ε_H^{-3}\})$ iterations. The proposed algorithm can be used in general constrained non-convex optimization as long as the constrained quadratic sub-problem can be solved efficiently. △ Less

Submitted 2 June, 2020; v1 submitted 3 October, 2018; originally announced October 2018.

arXiv:1803.02968 [pdf, other]

Learning Deep Models: Critical Points and Local Openness

Authors: Maher Nouiehed, Meisam Razaviyayn

Abstract: With the increasing popularity of non-convex deep models, develo** a unifying theory for studying the optimization problems that arise from training these models becomes very significant. Toward this end, we present in this paper a unifying landscape analysis framework that can be used when the training objective function is the composite of simple functions. Using the local openness property… ▽ More With the increasing popularity of non-convex deep models, develo** a unifying theory for studying the optimization problems that arise from training these models becomes very significant. Toward this end, we present in this paper a unifying landscape analysis framework that can be used when the training objective function is the composite of simple functions. Using the local openness property of the underlying training models, we provide simple sufficient conditions under which any local optimum of the resulting optimization problem is globally optimal. We first completely characterize the local openness of the symmetric and non-symmetric matrix multiplication map** . Then we use our characterization to: 1) provide a simple proof for the classical result of Burer-Monteiro and extend it to non-continuous loss functions. 2) Show that every local optimum of two layer linear networks is globally optimal. Unlike many existing results in the literature, our result requires no assumption on the target data matrix Y, and input data matrix X. 3) Develop a complete characterization of the local/global optima equivalence of multi-layer linear neural networks. We provide various counterexamples to show the necessity of each of our assumptions. 4) Show global/local optima equivalence of over-parameterized non-linear deep models having a certain pyramidal structure. In contrast to existing works, our result requires no assumption on the differentiability of the activation functions and can go beyond "full-rank" cases. △ Less

Submitted 4 August, 2023; v1 submitted 8 March, 2018; originally announced March 2018.

arXiv:1802.08941 [pdf, ps, other]

Gradient Primal-Dual Algorithm Converges to Second-Order Stationary Solutions for Nonconvex Distributed Optimization

Authors: Mingyi Hong, Jason D. Lee, Meisam Razaviyayn

Abstract: In this work, we study two first-order primal-dual based algorithms, the Gradient Primal-Dual Algorithm (GPDA) and the Gradient Alternating Direction Method of Multipliers (GADMM), for solving a class of linearly constrained non-convex optimization problems. We show that with random initialization of the primal and dual variables, both algorithms are able to compute second-order stationary solutio… ▽ More In this work, we study two first-order primal-dual based algorithms, the Gradient Primal-Dual Algorithm (GPDA) and the Gradient Alternating Direction Method of Multipliers (GADMM), for solving a class of linearly constrained non-convex optimization problems. We show that with random initialization of the primal and dual variables, both algorithms are able to compute second-order stationary solutions (ss2) with probability one. This is the first result showing that primal-dual algorithm is capable of finding ss2 when only using first-order information, it also extends the existing results for first-order, but primal-only algorithms. An important implication of our result is that it also gives rise to the first global convergence result to the ss2, for two classes of unconstrained distributed non-convex learning problems over multi-agent networks. △ Less

Submitted 24 February, 2018; originally announced February 2018.

arXiv:1802.08249 [pdf, other]

On the Convergence and Robustness of Training GANs with Regularized Optimal Transport

Authors: Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, Jason D. Lee

Abstract: Generative Adversarial Networks (GANs) are one of the most practical methods for learning data distributions. A popular GAN formulation is based on the use of Wasserstein distance as a metric between probability distributions. Unfortunately, minimizing the Wasserstein distance between the data distribution and the generative model distribution is a computationally challenging problem as its object… ▽ More Generative Adversarial Networks (GANs) are one of the most practical methods for learning data distributions. A popular GAN formulation is based on the use of Wasserstein distance as a metric between probability distributions. Unfortunately, minimizing the Wasserstein distance between the data distribution and the generative model distribution is a computationally challenging problem as its objective is non-convex, non-smooth, and even hard to compute. In this work, we show that obtaining gradient information of the smoothed Wasserstein GAN formulation, which is based on regularized Optimal Transport (OT), is computationally effortless and hence one can apply first order optimization methods to minimize this objective. Consequently, we establish theoretical convergence guarantee to stationarity for a proposed class of GAN optimization algorithms. Unlike the original non-smooth formulation, our algorithm only requires solving the discriminator to approximate optimality. We apply our method to learning MNIST digits as well as CIFAR-10images. Our experiments show that our method is computationally efficient and generates images comparable to the state of the art algorithms given the same architecture and computational power. △ Less

Submitted 22 May, 2018; v1 submitted 21 February, 2018; originally announced February 2018.

arXiv:1801.04005 [pdf, other]

Minimax Optimality of Sign Test for Paired Heterogeneous Data

Authors: Martin J. Zhang, Meisam Razaviyayn, David Tse

Abstract: Comparing two groups under different conditions is ubiquitous in the biomedical sciences. In many cases, samples from the two groups can be naturally paired; for example a pair of samples may come from the same individual under the two conditions. However samples across different individuals may be highly heterogeneous. Traditional methods often ignore such heterogeneity by assuming the samples ar… ▽ More Comparing two groups under different conditions is ubiquitous in the biomedical sciences. In many cases, samples from the two groups can be naturally paired; for example a pair of samples may come from the same individual under the two conditions. However samples across different individuals may be highly heterogeneous. Traditional methods often ignore such heterogeneity by assuming the samples are identically distributed. In this work, we study the problem of comparing paired heterogeneous data by modeling the data as Gaussian distributed with different parameters across the samples. We show that in the minimax setting where we want to maximize the worst-case power, the sign test, which only uses the signs of the differences between the paired sample, is optimal in the one-sided case and near optimal in the two-sided case. The superiority of the sign test over other popular tests for paired heterogeneous data is demonstrated using both synthetic data and a real-world RNA-Seq dataset. △ Less

Submitted 11 January, 2018; originally announced January 2018.

arXiv:1704.03535 [pdf, other]

On the Pervasiveness of Difference-Convexity in Optimization and Statistics

Authors: Maher Nouiehed, Jong-Shi Pang, Meisam Razaviyayn

Abstract: With the increasing interest in applying the methodology of difference-of-convex (dc) optimization to diverse problems in engineering and statistics, this paper establishes the dc property of many well-known functions not previously known to be of this class. Motivated by a quadratic programming based recourse function in two-stage stochastic programming, we show that the (optimal) value function… ▽ More With the increasing interest in applying the methodology of difference-of-convex (dc) optimization to diverse problems in engineering and statistics, this paper establishes the dc property of many well-known functions not previously known to be of this class. Motivated by a quadratic programming based recourse function in two-stage stochastic programming, we show that the (optimal) value function of a copositive (thus not necessarily convex) quadratic program is dc on the domain of finiteness of the program when the matrix in the objective function's quadratic term and the constraint matrix are fixed. The proof of this result is based on a dc decomposition of a piecewise LC1 function (i.e., functions with Lipschitz gradients). Armed with these new results and known properties of dc functions existed in the literature, we show that many composite statistical functions in risk analysis, including the value-at-risk (VaR), conditional value-at-risk (CVaR), expectation-based, VaR-based, and CVaR-based random deviation functions are all dc. Adding the known class of dc surrogate sparsity functions that are employed as approximations of the l_0 function in statistical learning, our work significantly expands the family of dc functions and positions them for fruitful applications. △ Less

Submitted 19 February, 2019; v1 submitted 11 April, 2017; originally announced April 2017.

arXiv:1607.03092 [pdf, ps, other]

doi 10.1109/TSP.2017.2731321

Inexact Block Coordinate Descent Methods For Symmetric Nonnegative Matrix Factorization

Authors: Qingjiang Shi, Haoran Sun, Songtao Lu, Mingyi Hong, Meisam Razaviyayn

Abstract: Symmetric nonnegative matrix factorization (SNMF) is equivalent to computing a symmetric nonnegative low rank approximation of a data similarity matrix. It inherits the good data interpretability of the well-known nonnegative matrix factorization technique and have better ability of clustering nonlinearly separable data. In this paper, we focus on the algorithmic aspect of the SNMF problem and pro… ▽ More Symmetric nonnegative matrix factorization (SNMF) is equivalent to computing a symmetric nonnegative low rank approximation of a data similarity matrix. It inherits the good data interpretability of the well-known nonnegative matrix factorization technique and have better ability of clustering nonlinearly separable data. In this paper, we focus on the algorithmic aspect of the SNMF problem and propose simple inexact block coordinate decent methods to address the problem, leading to both serial and parallel algorithms. The proposed algorithms have guaranteed stationary convergence and can efficiently handle large-scale and/or sparse SNMF problems. Extensive simulations verify the effectiveness of the proposed algorithms compared to recent state-of-the-art algorithms. △ Less

Submitted 11 July, 2016; originally announced July 2016.

Comments: Submitted to TSP

arXiv:1511.02746 [pdf, other]

A Unified Algorithmic Framework for Block-Structured Optimization Involving Big Data

Authors: Mingyi Hong, Meisam Razaviyayn, Zhi-Quan Luo, Jong-Shi Pang

Abstract: This article presents a powerful algorithmic framework for big data optimization, called the Block Successive Upper bound Minimization (BSUM). The BSUM includes as special cases many well-known methods for analyzing massive data sets, such as the Block Coordinate Descent (BCD), the Convex-Concave Procedure (CCCP), the Block Coordinate Proximal Gradient (BCPG) method, the Nonnegative Matrix Factori… ▽ More This article presents a powerful algorithmic framework for big data optimization, called the Block Successive Upper bound Minimization (BSUM). The BSUM includes as special cases many well-known methods for analyzing massive data sets, such as the Block Coordinate Descent (BCD), the Convex-Concave Procedure (CCCP), the Block Coordinate Proximal Gradient (BCPG) method, the Nonnegative Matrix Factorization (NMF), the Expectation Maximization (EM) method and so on. In this article, various features and properties of the BSUM are discussed from the viewpoint of design flexibility, computational efficiency, parallel/distributed implementation and the required communication overhead. Illustrative examples from networking, signal processing and machine learning are presented to demonstrate the practical performance of the BSUM framework △ Less

Submitted 9 November, 2015; originally announced November 2015.

arXiv:1511.01796 [pdf, ps, other]

Computing B-Stationary Points of Nonsmooth DC Programs

Authors: Jong-Shi Pang, Meisam Razaviyayn, Alberth Alvarado

Abstract: Motivated by a class of applied problems arising from physical layer based security in a digital communication system, in particular, by a secrecy sum-rate maximization problem, this paper studies a nonsmooth, difference-of-convex (dc) minimization problem. The contributions of this paper are: (i) clarify several kinds of stationary solutions and their relations; (ii) develop and establish the con… ▽ More Motivated by a class of applied problems arising from physical layer based security in a digital communication system, in particular, by a secrecy sum-rate maximization problem, this paper studies a nonsmooth, difference-of-convex (dc) minimization problem. The contributions of this paper are: (i) clarify several kinds of stationary solutions and their relations; (ii) develop and establish the convergence of a novel algorithm for computing a d-stationary solution of a problem with a convex feasible set that is arguably the sharpest kind among the various stationary solutions; (iii) extend the algorithm in several directions including: a randomized choice of the subproblems that could help the practical convergence of the algorithm, a distributed penalty approach for problems whose objective functions are sums of dc functions, and problems with a specially structured (nonconvex) dc constraint. For the latter class of problems, a pointwise Slater constraint qualification is introduced that facilitates the verification and computation of a B(ouligand)-stationary point. △ Less

Submitted 5 November, 2015; originally announced November 2015.

arXiv:1410.1390 [pdf, ps, other]

Convergence Analysis of Alternating Direction Method of Multipliers for a Family of Nonconvex Problems

Authors: Mingyi Hong, Zhi-Quan Luo, Meisam Razaviyayn

Abstract: The alternating direction method of multipliers (ADMM) is widely used to solve large-scale linearly constrained optimization problems, convex or nonconvex, in many engineering fields. However there is a general lack of theoretical understanding of the algorithm when the objective function is nonconvex. In this paper we analyze the convergence of the ADMM for solving certain nonconvex consensus and… ▽ More The alternating direction method of multipliers (ADMM) is widely used to solve large-scale linearly constrained optimization problems, convex or nonconvex, in many engineering fields. However there is a general lack of theoretical understanding of the algorithm when the objective function is nonconvex. In this paper we analyze the convergence of the ADMM for solving certain nonconvex consensus and sharing problems, and show that the classical ADMM converges to the set of stationary solutions, provided that the penalty parameter in the augmented Lagrangian is chosen to be sufficiently large. For the sharing problems, we show that the ADMM is convergent regardless of the number of variable blocks. Our analysis does not impose any assumptions on the iterates generated by the algorithm, and is broadly applicable to many ADMM variants involving proximal update rules and various flexible block selection rules. △ Less

Submitted 29 November, 2015; v1 submitted 6 October, 2014; originally announced October 2014.

Comments: Accepted by SIOPT

arXiv:1406.3665 [pdf, ps, other]

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Authors: Meisam Razaviyayn, Mingyi Hong, Zhi-Quan Luo, Jong-Shi Pang

Abstract: Consider the problem of minimizing the sum of a smooth (possibly non-convex) and a convex (possibly nonsmooth) function involving a large number of variables. A popular approach to solve this problem is the block coordinate descent (BCD) method whereby at each iteration only one variable block is updated while the remaining variables are held fixed. With the recent advances in the developments of… ▽ More Consider the problem of minimizing the sum of a smooth (possibly non-convex) and a convex (possibly nonsmooth) function involving a large number of variables. A popular approach to solve this problem is the block coordinate descent (BCD) method whereby at each iteration only one variable block is updated while the remaining variables are held fixed. With the recent advances in the developments of the multi-core parallel processing technology, it is desirable to parallelize the BCD method by allowing multiple blocks to be updated simultaneously at each iteration of the algorithm. In this work, we propose an inexact parallel BCD approach where at each iteration, a subset of the variables is updated in parallel by minimizing convex approximations of the original objective function. We investigate the convergence of this parallel BCD method for both randomized and cyclic variable selection rules. We analyze the asymptotic and non-asymptotic convergence behavior of the algorithm for both convex and non-convex objective functions. The numerical experiments suggest that for a special case of Lasso minimization problem, the cyclic block selection rule can outperform the randomized rule. △ Less

Submitted 31 October, 2014; v1 submitted 13 June, 2014; originally announced June 2014.

arXiv:1401.7079 [pdf, ps, other]

A Block Successive Upper Bound Minimization Method of Multipliers for Linearly Constrained Convex Optimization

Authors: Mingyi Hong, Tsung-Hui Chang, Xiangfeng Wang, Meisam Razaviyayn, Shiqian Ma, Zhi-Quan Luo

Abstract: Consider the problem of minimizing the sum of a smooth convex function and a separable nonsmooth convex function subject to linear coupling constraints. Problems of this form arise in many contemporary applications including signal processing, wireless networking and smart grid provisioning. Motivated by the huge size of these applications, we propose a new class of first order primal-dual algorit… ▽ More Consider the problem of minimizing the sum of a smooth convex function and a separable nonsmooth convex function subject to linear coupling constraints. Problems of this form arise in many contemporary applications including signal processing, wireless networking and smart grid provisioning. Motivated by the huge size of these applications, we propose a new class of first order primal-dual algorithms called the block successive upper-bound minimization method of multipliers (BSUM-M) to solve this family of problems. The BSUM-M updates the primal variable blocks successively by minimizing locally tight upper-bounds of the augmented Lagrangian of the original problem, followed by a gradient type update for the dual variable in closed form. We show that under certain regularity conditions, and when the primal block variables are updated in either a deterministic or a random fashion, the BSUM-M converges to the set of optimal solutions. Moreover, in the absence of linear constraints, we show that the BSUM-M, which reduces to the block successive upper-bound minimization (BSUM) method, is capable of linear convergence without strong convexity. △ Less

Submitted 27 January, 2014; originally announced January 2014.

arXiv:1310.6957 [pdf, ps, other]

Iteration Complexity Analysis of Block Coordinate Descent Methods

Authors: Mingyi Hong, Xiangfeng Wang, Meisam Razaviyayn, Zhi-Quan Luo

Abstract: In this paper, we provide a unified iteration complexity analysis for a family of general block coordinate descent (BCD) methods, covering popular methods such as the block coordinate gradient descent (BCGD) and the block coordinate proximal gradient (BCPG), under various different coordinate update rules. We unify these algorithms under the so-called Block Successive Upper-bound Minimization (BSU… ▽ More In this paper, we provide a unified iteration complexity analysis for a family of general block coordinate descent (BCD) methods, covering popular methods such as the block coordinate gradient descent (BCGD) and the block coordinate proximal gradient (BCPG), under various different coordinate update rules. We unify these algorithms under the so-called Block Successive Upper-bound Minimization (BSUM) framework, and show that for a broad class of multi-block nonsmooth convex problems, all algorithms covered by the BSUM framework achieve a global sublinear iteration complexity of $O(1/r)$, where r is the iteration index. Moreover, for the case of block coordinate minimization (BCM) where each block is minimized exactly, we establish the sublinear convergence rate of $O(1/r)$ without per block strong convexity assumption. Further, we show that when there are only two blocks of variables, a special BSUM algorithm with Gauss-Seidel rule can be accelerated to achieve an improved rate of $O(1/r^2)$. △ Less

Submitted 28 April, 2015; v1 submitted 25 October, 2013; originally announced October 2013.

arXiv:1307.4457 [pdf, ps, other]

A Stochastic Successive Minimization Method for Nonsmooth Nonconvex Optimization with Applications to Transceiver Design in Wireless Communication Networks

Authors: Meisam Razaviyayn, Maziar Sanjabi, Zhi-Quan Luo

Abstract: Consider the problem of minimizing the expected value of a cost function parameterized by a random variable. The classical sample average approximation (SAA) method for solving this problem requires minimization of an ensemble average of the objective at each step, which can be expensive. In this paper, we propose a stochastic successive upper-bound minimization method (SSUM) which minimizes an ap… ▽ More Consider the problem of minimizing the expected value of a cost function parameterized by a random variable. The classical sample average approximation (SAA) method for solving this problem requires minimization of an ensemble average of the objective at each step, which can be expensive. In this paper, we propose a stochastic successive upper-bound minimization method (SSUM) which minimizes an approximate ensemble average at each iteration. To ensure convergence and to facilitate computation, we require the approximate ensemble average to be a locally tight upper-bound of the expected cost function and be easily optimized. The main contributions of this work include the development and analysis of the SSUM method as well as its applications in linear transceiver design for wireless communication networks and online dictionary learning. Moreover, using the SSUM framework, we extend the classical stochastic (sub-)gradient (SG) method to the case of minimizing a nonsmooth nonconvex objective function and establish its convergence. △ Less

Submitted 22 July, 2013; v1 submitted 16 July, 2013; originally announced July 2013.

arXiv:1209.2385 [pdf, ps, other]

A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization

Authors: Meisam Razaviyayn, Mingyi Hong, Zhi-Quan Luo

Abstract: The block coordinate descent (BCD) method is widely used for minimizing a continuous function f of several block variables. At each iteration of this method, a single block of variables is optimized, while the remaining variables are held fixed. To ensure the convergence of the BCD method, the subproblem to be optimized in each iteration needs to be solved exactly to its unique optimal solution. U… ▽ More The block coordinate descent (BCD) method is widely used for minimizing a continuous function f of several block variables. At each iteration of this method, a single block of variables is optimized, while the remaining variables are held fixed. To ensure the convergence of the BCD method, the subproblem to be optimized in each iteration needs to be solved exactly to its unique optimal solution. Unfortunately, these requirements are often too restrictive for many practical scenarios. In this paper, we study an alternative inexact BCD approach which updates the variable blocks by successively minimizing a sequence of approximations of f which are either locally tight upper bounds of f or strictly convex local approximations of f. We focus on characterizing the convergence properties for a fairly wide class of such methods, especially for the cases where the objective functions are either non-differentiable or nonconvex. Our results unify and extend the existing convergence results for many classical algorithms such as the BCD method, the difference of convex functions (DC) method, the expectation maximization (EM) algorithm, as well as the alternating proximal minimization algorithm. △ Less

Submitted 11 September, 2012; originally announced September 2012.

arXiv:1104.0992 [pdf, ps, other]

doi 10.1109/TSP.2011.2173683

On the Degrees of Freedom Achievable Through Interference Alignment in a MIMO Interference Channel

Authors: Meisam Razaviyayn, Gennady Lyubeznik, Zhi-Quan Luo

Abstract: Consider a K-user flat fading MIMO interference channel where the k-th transmitter (or receiver) is equipped with M_k (respectively N_k) antennas. If a large number of statistically independent channel extensions are allowed either across time or frequency, the recent work [1] suggests that the total achievable degrees of freedom (DoF) can be maximized via interference alignment, resulting in a to… ▽ More Consider a K-user flat fading MIMO interference channel where the k-th transmitter (or receiver) is equipped with M_k (respectively N_k) antennas. If a large number of statistically independent channel extensions are allowed either across time or frequency, the recent work [1] suggests that the total achievable degrees of freedom (DoF) can be maximized via interference alignment, resulting in a total DoF that grows linearly with K even if M_k and N_k are bounded. In this work we consider the case where no channel extension is allowed, and establish a general condition that must be satisfied by any degrees of freedom tuple (d_1, d2, ..., d_K) achievable through linear interference alignment. For a symmetric system with M_k = M, N_k = N, d_k = d for all k, this condition implies that the total achievable DoF cannot grow linearly with K, and is in fact no more than K(M + N)=(K + 1). We also show that this bound is tight when the number of antennas at each transceiver is divisible by the number of data streams. △ Less

Submitted 2 September, 2011; v1 submitted 5 April, 2011; originally announced April 2011.

Showing 1–38 of 38 results for author: Razaviyayn, M