-
The Role of Level-Set Geometry on the Performance of PDHG for Conic Linear Optimization
Authors:
Zikai Xiong,
Robert M. Freund
Abstract:
We consider solving huge-scale instances of (convex) conic linear optimization problems, at the scale where matrix-factorization-free methods are attractive or necessary. The restarted primal-dual hybrid gradient method (rPDHG) -- with heuristic enhancements and GPU implementation -- has been very successful in solving huge-scale linear programming (LP) problems; however its application to more ge…
▽ More
We consider solving huge-scale instances of (convex) conic linear optimization problems, at the scale where matrix-factorization-free methods are attractive or necessary. The restarted primal-dual hybrid gradient method (rPDHG) -- with heuristic enhancements and GPU implementation -- has been very successful in solving huge-scale linear programming (LP) problems; however its application to more general conic convex optimization problems is not so well-studied. We analyze the theoretical and practical performance of rPDHG for general (convex) conic linear optimization, and LP as a special case thereof. We show a relationship between the geometry of the primal-dual (sub-)level sets $W_\varepsilon$ and the convergence rate of rPDHG. Specifically, we prove a bound on the convergence rate of rPDHG that improves when there is a primal-dual (sub-)level set $W_\varepsilon$ for which (i) $W_\varepsilon$ is close to the optimal solution set (in Hausdorff distance), and (ii) the ratio of the diameter to the "conic radius" of $W_\varepsilon$ is small. And in the special case of LP problems, the performance of rPDHG is bounded only by this ratio applied to the (sub-)level set corresponding to the best non-optimal extreme point. Depending on the problem instance, this ratio can take on extreme values and can result in poor performance of rPDHG both in theory and in practice. To address this issue, we show how central-path-based linear transformations -- including conic rescaling -- can markedly enhance the convergence rate of rPDHG. Furthermore, we present computational results that demonstrate how such rescalings can accelerate convergence to high-accuracy solutions, and lead to more efficient methods for huge-scale linear optimization problems.
△ Less
Submitted 23 June, 2024; v1 submitted 3 June, 2024;
originally announced June 2024.
-
Computational Guarantees for Restarted PDHG for LP based on "Limiting Error Ratios" and LP Sharpness
Authors:
Zikai Xiong,
Robert Michael Freund
Abstract:
In recent years, there has been growing interest in solving linear optimization problems - or more simply "LP" - using first-order methods. The restarted primal-dual hybrid gradient method (PDHG) - together with some heuristic techniques - has emerged as a powerful tool for solving huge-scale LPs. However, the theoretical understanding of it and the validation of various heuristic implementation t…
▽ More
In recent years, there has been growing interest in solving linear optimization problems - or more simply "LP" - using first-order methods. The restarted primal-dual hybrid gradient method (PDHG) - together with some heuristic techniques - has emerged as a powerful tool for solving huge-scale LPs. However, the theoretical understanding of it and the validation of various heuristic implementation techniques are still very limited. Existing complexity analyses have relied on the Hoffman constant of the LP KKT system, which is known to be overly conservative, difficult to compute (and hence difficult to empirically validate), and fails to offer insight into instance-specific characteristics of the LP problems. These limitations have limited the capability to discern which characteristics of LP instances lead to easy versus difficult LP. With the goal of overcoming these limitations, in this paper we introduce and develop two purely geometry-based condition measures for LP instances: the "limiting error ratio" and the LP sharpness. We provide new computational guarantees for the restarted PDHG based on these two condition measures. For the limiting error ratio, we provide a computable upper bound and show its relationship with the data instance's proximity to infeasibility under perturbation. For the LP sharpness, we prove its equivalence to the stability of the LP optimal solution set under perturbation of the objective function. We validate our computational guarantees in terms of these condition measures via specially constructed instances. Conversely, our computational guarantees validate the practical efficacy of certain heuristic techniques (row preconditioners and step-size tuning) that improve computational performance in practice. Finally, we present computational experiments on LP relaxations from the MIPLIB dataset that demonstrate the promise of various implementation strategies.
△ Less
Submitted 29 April, 2024; v1 submitted 22 December, 2023;
originally announced December 2023.
-
On the Relation Between LP Sharpness and Limiting Error Ratio and Complexity Implications for Restarted PDHG
Authors:
Zikai Xiong,
Robert M. Freund
Abstract:
There has been a recent surge in development of first-order methods (FOMs) for solving huge-scale linear programming (LP) problems. The attractiveness of FOMs for LP stems in part from the fact that they avoid costly matrix factorization computation. However, the efficiency of FOMs is significantly influenced - both in theory and in practice - by certain instance-specific LP condition measures. Xi…
▽ More
There has been a recent surge in development of first-order methods (FOMs) for solving huge-scale linear programming (LP) problems. The attractiveness of FOMs for LP stems in part from the fact that they avoid costly matrix factorization computation. However, the efficiency of FOMs is significantly influenced - both in theory and in practice - by certain instance-specific LP condition measures. Xiong and Freund recently showed that the performance of the restarted primal-dual hybrid gradient method (PDHG) is predominantly determined by two specific condition measures: LP sharpness and Limiting Error Ratio. In this paper we examine the relationship between these two measures, particularly in the case when the optimal solution is unique (which is generic - at least in theory), and we present an upper bound on the Limiting Error Ratio involving the reciprocal of the LP sharpness. This shows that in LP instances where there is a dual nondegenerate optimal solution, the computational complexity of restarted PDHG can be characterized solely in terms of LP sharpness and the distance to optimal solutions, and simplifies the theoretical complexity upper bound of restarted PDHG for these instances.
△ Less
Submitted 27 December, 2023; v1 submitted 21 December, 2023;
originally announced December 2023.
-
Nonlinear conjugate gradient methods: worst-case convergence rates via computer-assisted analyses
Authors:
Shuvomoy Das Gupta,
Robert M. Freund,
Xu Andy Sun,
Adrien Taylor
Abstract:
We propose a computer-assisted approach to the analysis of the worst-case convergence of nonlinear conjugate gradient methods (NCGMs). Those methods are known for their generally good empirical performances for large-scale optimization, while having relatively incomplete analyses. Using our computer-assisted approach, we establish novel complexity bounds for the Polak-Ribière-Polyak (PRP) and the…
▽ More
We propose a computer-assisted approach to the analysis of the worst-case convergence of nonlinear conjugate gradient methods (NCGMs). Those methods are known for their generally good empirical performances for large-scale optimization, while having relatively incomplete analyses. Using our computer-assisted approach, we establish novel complexity bounds for the Polak-Ribière-Polyak (PRP) and the Fletcher-Reeves (FR) NCGMs for smooth strongly convex minimization. In particular, we construct mathematical proofs that establish the first non-asymptotic convergence bound for FR (which is historically the first developed NCGM), and a much improved non-asymptotic convergence bound for PRP. Additionally, we provide simple adversarial examples on which these methods do not perform better than gradient descent with exact line search, leaving very little room for improvements on the same class of problems.
△ Less
Submitted 18 April, 2024; v1 submitted 4 January, 2023;
originally announced January 2023.
-
Using Taylor-Approximated Gradients to Improve the Frank-Wolfe Method for Empirical Risk Minimization
Authors:
Zikai Xiong,
Robert M. Freund
Abstract:
The Frank-Wolfe method has become increasingly useful in statistical and machine learning applications, due to the structure-inducing properties of the iterates, and especially in settings where linear minimization over the feasible set is more computationally efficient than projection. In the setting of Empirical Risk Minimization -- one of the fundamental optimization problems in statistical and…
▽ More
The Frank-Wolfe method has become increasingly useful in statistical and machine learning applications, due to the structure-inducing properties of the iterates, and especially in settings where linear minimization over the feasible set is more computationally efficient than projection. In the setting of Empirical Risk Minimization -- one of the fundamental optimization problems in statistical and machine learning -- the computational effectiveness of Frank-Wolfe methods typically grows linearly in the number of data observations $n$. This is in stark contrast to the case for typical stochastic projection methods. In order to reduce this dependence on $n$, we look to second-order smoothness of typical smooth loss functions (least squares loss and logistic loss, for example) and we propose amending the Frank-Wolfe method with Taylor series-approximated gradients, including variants for both deterministic and stochastic settings. Compared with current state-of-the-art methods in the regime where the optimality tolerance $\varepsilon$ is sufficiently small, our methods are able to simultaneously reduce the dependence on large $n$ while obtaining optimal convergence rates of Frank-Wolfe methods, in both the convex and non-convex settings. We also propose a novel adaptive step-size approach for which we have computational guarantees. Last of all, we present computational experiments which show that our methods exhibit very significant speed-ups over existing methods on real-world datasets for both convex and non-convex binary classification problems.
△ Less
Submitted 21 November, 2023; v1 submitted 29 August, 2022;
originally announced August 2022.
-
Analysis of the Frank-Wolfe Method for Convex Composite Optimization involving a Logarithmically-Homogeneous Barrier
Authors:
Renbo Zhao,
Robert M. Freund
Abstract:
We present and analyze a new generalized Frank-Wolfe method for the composite optimization problem $(P):{\min}_{x\in\mathbb{R}^n}\; f(\mathsf{A} x) + h(x)$, where $f$ is a $θ$-logarithmically-homogeneous self-concordant barrier, $\mathsf{A}$ is a linear operator and the function $h$ has bounded domain but is possibly non-smooth. We show that our generalized Frank-Wolfe method requires…
▽ More
We present and analyze a new generalized Frank-Wolfe method for the composite optimization problem $(P):{\min}_{x\in\mathbb{R}^n}\; f(\mathsf{A} x) + h(x)$, where $f$ is a $θ$-logarithmically-homogeneous self-concordant barrier, $\mathsf{A}$ is a linear operator and the function $h$ has bounded domain but is possibly non-smooth. We show that our generalized Frank-Wolfe method requires $O((δ_0 + θ+ R_h)\ln(δ_0) + (θ+ R_h)^2/\varepsilon)$ iterations to produce an $\varepsilon$-approximate solution, where $δ_0$ denotes the initial optimality gap and $R_h$ is the variation of $h$ on its domain. This result establishes certain intrinsic connections between $θ$-logarithmically homogeneous barriers and the Frank-Wolfe method. When specialized to the $D$-optimal design problem, we essentially recover the complexity obtained by Khachiyan using the Frank-Wolfe method with exact line-search. We also study the (Fenchel) dual problem of $(P)$, and we show that our new method is equivalent to an adaptive-step-size mirror descent method applied to the dual problem. This enables us to provide iteration complexity bounds for the mirror descent method despite even though the dual objective function is non-Lipschitz and has unbounded domain. In addition, we present computational experiments that point to the potential usefulness of our generalized Frank-Wolfe method on Poisson image de-blurring problems with TV regularization, and on simulated PET problem instances.
△ Less
Submitted 5 December, 2021; v1 submitted 18 October, 2020;
originally announced October 2020.
-
Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization
Authors:
Geoffrey Négiar,
Gideon Dresdner,
Alicia Tsai,
Laurent El Ghaoui,
Francesco Locatello,
Robert M. Freund,
Fabian Pedregosa
Abstract:
We propose a novel Stochastic Frank-Wolfe (a.k.a. conditional gradient) algorithm for constrained smooth finite-sum minimization with a generalized linear prediction/structure. This class of problems includes empirical risk minimization with sparse, low-rank, or other structured constraints. The proposed method is simple to implement, does not require step-size tuning, and has a constant per-itera…
▽ More
We propose a novel Stochastic Frank-Wolfe (a.k.a. conditional gradient) algorithm for constrained smooth finite-sum minimization with a generalized linear prediction/structure. This class of problems includes empirical risk minimization with sparse, low-rank, or other structured constraints. The proposed method is simple to implement, does not require step-size tuning, and has a constant per-iteration cost that is independent of the dataset size. Furthermore, as a byproduct of the method we obtain a stochastic estimator of the Frank-Wolfe gap that can be used as a stop** criterion. Depending on the setting, the proposed method matches or improves on the best computational guarantees for Stochastic Frank-Wolfe algorithms. Benchmarks on several datasets highlight different regimes in which the proposed method exhibits a faster empirical convergence than related methods. Finally, we provide an implementation of all considered methods in an open-source package.
△ Less
Submitted 8 September, 2022; v1 submitted 26 February, 2020;
originally announced February 2020.
-
An Oblivious Ellipsoid Algorithm for Solving a System of (In)Feasible Linear Inequalities
Authors:
Jourdain Lamperski,
Robert M. Freund,
Michael J. Todd
Abstract:
The ellipsoid algorithm is a fundamental algorithm for computing a solution to the system of $m$ linear inequalities in $n$ variables $(P): A^{\top}x \le u$ when its set of solutions has positive volume. However, when $(P)$ is infeasible, the ellipsoid algorithm has no mechanism for proving that $(P)$ is infeasible. This is in contrast to the other two fundamental algorithms for tackling $(P)$, na…
▽ More
The ellipsoid algorithm is a fundamental algorithm for computing a solution to the system of $m$ linear inequalities in $n$ variables $(P): A^{\top}x \le u$ when its set of solutions has positive volume. However, when $(P)$ is infeasible, the ellipsoid algorithm has no mechanism for proving that $(P)$ is infeasible. This is in contrast to the other two fundamental algorithms for tackling $(P)$, namely the simplex method and interior-point methods, each of which can be easily implemented in a way that either produces a solution of $(P)$ or proves that $(P)$ is infeasible by producing a solution to the alternative system $\mathrm{({\it Alt})}: Aλ= 0$, $u^{\top}λ< 0$, $λ\ge 0$. This paper develops an Oblivious Ellipsoid Algorithm (OEA) that either produces a solution of $(P)$ or produces a solution of $\mathrm{({\it Alt})}$. Depending on the dimensions and on other natural condition measures, the computational complexity of the basic OEA may be worse than, the same as, or better than that of the standard ellipsoid algorithm. We also present two modified versions of OEA, whose computational complexity is superior to that of OEA when $n \ll m$. This is achieved in the first modified version by proving infeasibility without actually producing a solution of $\mathrm{({\it Alt})}$, and in the second modified version by using more memory.
△ Less
Submitted 28 December, 2020; v1 submitted 7 October, 2019;
originally announced October 2019.
-
Condition Number Analysis of Logistic Regression, and its Implications for Standard First-Order Solution Methods
Authors:
Robert M. Freund,
Paul Grigas,
Rahul Mazumder
Abstract:
Logistic regression is one of the most popular methods in binary classification, wherein estimation of model parameters is carried out by solving the maximum likelihood (ML) optimization problem, and the ML estimator is defined to be the optimal solution of this problem. It is well known that the ML estimator exists when the data is non-separable, but fails to exist when the data is separable. Fir…
▽ More
Logistic regression is one of the most popular methods in binary classification, wherein estimation of model parameters is carried out by solving the maximum likelihood (ML) optimization problem, and the ML estimator is defined to be the optimal solution of this problem. It is well known that the ML estimator exists when the data is non-separable, but fails to exist when the data is separable. First-order methods are the algorithms of choice for solving large-scale instances of the logistic regression problem. In this paper, we introduce a pair of condition numbers that measure the degree of non-separability or separability of a given dataset in the setting of binary classification, and we study how these condition numbers relate to and inform the properties and the convergence guarantees of first-order methods. When the training data is non-separable, we show that the degree of non-separability naturally enters the analysis and informs the properties and convergence guarantees of two standard first-order methods: steepest descent (for any given norm) and stochastic gradient descent. Expanding on the work of Bach, we also show how the degree of non-separability enters into the analysis of linear convergence of steepest descent (without needing strong convexity), as well as the adaptive convergence of stochastic gradient descent. When the training data is separable, first-order methods rather curiously have good empirical success, which is not well understood in theory. In the case of separable data, we demonstrate how the degree of separability enters into the analysis of $\ell_2$ steepest descent and stochastic gradient descent for delivering approximate-maximum-margin solutions with associated computational guarantees as well. This suggests that first-order methods can lead to statistically meaningful solutions in the separable case, even though the ML solution does not exist.
△ Less
Submitted 19 October, 2018;
originally announced October 2018.
-
Generalized Stochastic Frank-Wolfe Algorithm with Stochastic "Substitute" Gradient for Structured Convex Optimization
Authors:
Haihao Lu,
Robert M. Freund
Abstract:
The stochastic Frank-Wolfe method has recently attracted much general interest in the context of optimization for statistical and machine learning due to its ability to work with a more general feasible region. However, there has been a complexity gap in the guaranteed convergence rate for stochastic Frank-Wolfe compared to its deterministic counterpart. In this work, we present a new generalized…
▽ More
The stochastic Frank-Wolfe method has recently attracted much general interest in the context of optimization for statistical and machine learning due to its ability to work with a more general feasible region. However, there has been a complexity gap in the guaranteed convergence rate for stochastic Frank-Wolfe compared to its deterministic counterpart. In this work, we present a new generalized stochastic Frank-Wolfe method which closes this gap for the class of structured optimization problems encountered in statistical and machine learning characterized by empirical loss minimization with a certain type of ``linear prediction'' property (formally defined in the paper), which is typically present loss minimization problems in practice. Our method also introduces the notion of a ``substitute gradient'' that is a not-necessarily-unbiased sample of the gradient. We show that our new method is equivalent to a particular randomized coordinate mirror descent algorithm applied to the dual problem, which in turn provides a new interpretation of randomized dual coordinate descent in the primal space. Also, in the special case of a strongly convex regularizer our generalized stochastic Frank-Wolfe method (as well as the randomized dual coordinate descent method) exhibits linear convergence. Furthermore, we present computational experiments that indicate that our method outperforms other stochastic Frank-Wolfe methods consistent with the theory developed herein.
△ Less
Submitted 4 November, 2019; v1 submitted 19 July, 2018;
originally announced July 2018.
-
Accelerating Greedy Coordinate Descent Methods
Authors:
Haihao Lu,
Robert M. Freund,
Vahab Mirrokni
Abstract:
We study ways to accelerate greedy coordinate descent in theory and in practice, where "accelerate" refers either to $O(1/k^2)$ convergence in theory, in practice, or both. We introduce and study two algorithms: Accelerated Semi-Greedy Coordinate Descent (ASCD) and Accelerated Greedy Coordinate Descent (AGCD). While ASCD takes greedy steps in the $x$-updates and randomized steps in the $z$-updates…
▽ More
We study ways to accelerate greedy coordinate descent in theory and in practice, where "accelerate" refers either to $O(1/k^2)$ convergence in theory, in practice, or both. We introduce and study two algorithms: Accelerated Semi-Greedy Coordinate Descent (ASCD) and Accelerated Greedy Coordinate Descent (AGCD). While ASCD takes greedy steps in the $x$-updates and randomized steps in the $z$-updates, AGCD is a straightforward extension of standard greedy coordinate descent that only takes greedy steps. On the theory side, our main results are for ASCD: we show that ASCD achieves $O(1/k^2)$ convergence, and it also achieves accelerated linear convergence for strongly convex functions. On the empirical side, we observe that both AGCD and ASCD outperform Accelerated Randomized Coordinate Descent on a variety of instances. In particular, we note that AGCD significantly outperforms the other accelerated coordinate descent methods in numerical tests, in spite of a lack of theoretical guarantees for this method. To complement the empirical study of AGCD, we present a Lyapunov energy function argument that points to an explanation for why a direct extension of the acceleration proof for AGCD does not work; and we also introduce a technical condition under which AGCD is guaranteed to have accelerated convergence. Last of all, we confirm that this technical condition holds in our empirical study.
△ Less
Submitted 6 June, 2018;
originally announced June 2018.
-
Relatively-Smooth Convex Optimization by First-Order Methods, and Applications
Authors:
Haihao Lu,
Robert M. Freund,
Yurii Nesterov
Abstract:
The usual approach to develo** and analyzing first-order methods for smooth convex optimization assumes that the gradient of the objective function is uniformly smooth with some Lipschitz constant $L$. However, in many settings the differentiable convex function $f(\cdot)$ is not uniformly smooth -- for example in $D$-optimal design where $f(x):=-\ln \det(HXH^T)$, or even the univariate setting…
▽ More
The usual approach to develo** and analyzing first-order methods for smooth convex optimization assumes that the gradient of the objective function is uniformly smooth with some Lipschitz constant $L$. However, in many settings the differentiable convex function $f(\cdot)$ is not uniformly smooth -- for example in $D$-optimal design where $f(x):=-\ln \det(HXH^T)$, or even the univariate setting with $f(x) := -\ln(x) + x^2$. Herein we develop a notion of "relative smoothness" and relative strong convexity that is determined relative to a user-specified "reference function" $h(\cdot)$ (that should be computationally tractable for algorithms), and we show that many differentiable convex functions are relatively smooth with respect to a correspondingly fairly-simple reference function $h(\cdot)$. We extend two standard algorithms -- the primal gradient scheme and the dual averaging scheme -- to our new setting, with associated computational guarantees. We apply our new approach to develop a new first-order method for the $D$-optimal design problem, with associated computational complexity analysis. Some of our results have a certain overlap with the recent work \cite{bbt}.
△ Less
Submitted 10 October, 2017; v1 submitted 18 October, 2016;
originally announced October 2016.
-
New Computational Guarantees for Solving Convex Optimization Problems with First Order Methods, via a Function Growth Condition Measure
Authors:
Robert M. Freund,
Haihao Lu
Abstract:
Motivated by recent work of Renegar, we present new computational methods and associated computational guarantees for solving convex optimization problems using first-order methods. Our problem of interest is the general convex optimization problem $f^* = \min_{x \in Q} f(x)$, where we presume knowledge of a strict lower bound $f_{\mathrm{slb}} < f^*$. [Indeed, $f_{\mathrm{slb}}$ is naturally know…
▽ More
Motivated by recent work of Renegar, we present new computational methods and associated computational guarantees for solving convex optimization problems using first-order methods. Our problem of interest is the general convex optimization problem $f^* = \min_{x \in Q} f(x)$, where we presume knowledge of a strict lower bound $f_{\mathrm{slb}} < f^*$. [Indeed, $f_{\mathrm{slb}}$ is naturally known when optimizing many loss functions in statistics and machine learning (least-squares, logistic loss, exponential loss, total variation loss, etc.) as well as in Renegar's transformed version of the standard conic optimization problem; in all these cases one has $f_{\mathrm{slb}} = 0 < f^*$.] We introduce a new functional measure called the growth constant $G$ for $f(\cdot)$, that measures how quickly the level sets of $f(\cdot)$ grow relative to the function value, and that plays a fundamental role in the complexity analysis. When $f(\cdot)$ is non-smooth, we present new computational guarantees for the Subgradient Descent Method and for smoothing methods, that can improve existing computational guarantees in several ways, most notably when the initial iterate $x^0$ is far from the optimal solution set. When $f(\cdot)$ is smooth, we present a scheme for periodically restarting the Accelerated Gradient Method that can also improve existing computational guarantees when $x^0$ is far from the optimal solution set, and in the presence of added structure we present a scheme using parametrically increased smoothing that further improves the associated computational guarantees.
△ Less
Submitted 8 November, 2016; v1 submitted 9 November, 2015;
originally announced November 2015.
-
An Extended Frank-Wolfe Method with "In-Face" Directions, and its Application to Low-Rank Matrix Completion
Authors:
Robert M. Freund,
Paul Grigas,
Rahul Mazumder
Abstract:
Motivated principally by the low-rank matrix completion problem, we present an extension of the Frank-Wolfe method that is designed to induce near-optimal solutions on low-dimensional faces of the feasible region. This is accomplished by a new approach to generating ``in-face" directions at each iteration, as well as through new choice rules for selecting between in-face and ``regular" Frank-Wolfe…
▽ More
Motivated principally by the low-rank matrix completion problem, we present an extension of the Frank-Wolfe method that is designed to induce near-optimal solutions on low-dimensional faces of the feasible region. This is accomplished by a new approach to generating ``in-face" directions at each iteration, as well as through new choice rules for selecting between in-face and ``regular" Frank-Wolfe steps. Our framework for generating in-face directions generalizes the notion of away-steps introduced by Wolfe. In particular, the in-face directions always keep the next iterate within the minimal face containing the current iterate. We present computational guarantees for the new method that trade off efficiency in computing near-optimal solutions with upper bounds on the dimension of minimal faces of iterates. We apply the new method to the matrix completion problem, where low-dimensional faces correspond to low-rank matrices. We present computational results that demonstrate the effectiveness of our methodological approach at producing nearly-optimal solutions of very low rank. On both artificial and real datasets, we demonstrate significant speed-ups in computing very low-rank nearly-optimal solutions as compared to either the Frank-Wolfe method or its traditional away-step variant.
△ Less
Submitted 6 November, 2015;
originally announced November 2015.
-
A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives
Authors:
Robert M. Freund,
Paul Grigas,
Rahul Mazumder
Abstract:
In this paper we analyze boosting algorithms in linear regression from a new perspective: that of modern first-order methods in convex optimization. We show that classic boosting algorithms in linear regression, namely the incremental forward stagewise algorithm (FS$_\varepsilon$) and least squares boosting (LS-Boost($\varepsilon$)), can be viewed as subgradient descent to minimize the loss functi…
▽ More
In this paper we analyze boosting algorithms in linear regression from a new perspective: that of modern first-order methods in convex optimization. We show that classic boosting algorithms in linear regression, namely the incremental forward stagewise algorithm (FS$_\varepsilon$) and least squares boosting (LS-Boost($\varepsilon$)), can be viewed as subgradient descent to minimize the loss function defined as the maximum absolute correlation between the features and residuals. We also propose a modification of FS$_\varepsilon$ that yields an algorithm for the Lasso, and that may be easily extended to an algorithm that computes the Lasso path for different values of the regularization parameter. Furthermore, we show that these new algorithms for the Lasso may also be interpreted as the same master algorithm (subgradient descent), applied to a regularized version of the maximum absolute correlation loss function. We derive novel, comprehensive computational guarantees for several boosting algorithms in linear regression (including LS-Boost($\varepsilon$) and FS$_\varepsilon$) by using techniques of modern first-order methods in convex optimization. Our computational guarantees inform us about the statistical properties of boosting algorithms. In particular they provide, for the first time, a precise theoretical description of the amount of data-fidelity and regularization imparted by running a boosting algorithm with a prespecified learning rate for a fixed but arbitrary number of iterations, for any dataset.
△ Less
Submitted 16 May, 2015;
originally announced May 2015.
-
Robust topology optimization of three-dimensional photonic-crystal band-gap structures
Authors:
Han Men,
Karen Y. K. Lee,
Robert M. Freund,
Jaime Peraire,
Steven G. Johnson
Abstract:
We perform full 3D topology optimization (in which "every voxel" of the unit cell is a degree of freedom) of photonic-crystal structures in order to find optimal omnidirectional band gaps for various symmetry groups, including fcc (including diamond), bcc, and simple-cubic lattices. Even without imposing the constraints of any fabrication process, the resulting optimal gaps are only slightly large…
▽ More
We perform full 3D topology optimization (in which "every voxel" of the unit cell is a degree of freedom) of photonic-crystal structures in order to find optimal omnidirectional band gaps for various symmetry groups, including fcc (including diamond), bcc, and simple-cubic lattices. Even without imposing the constraints of any fabrication process, the resulting optimal gaps are only slightly larger than previous hand designs, suggesting that current photonic crystals are nearly optimal in this respect. However, optimization can discover new structures, e.g. a new fcc structure with the same symmetry but slightly larger gap than the well known inverse opal, which may offer new degrees of freedom to future fabrication technologies. Furthermore, our band-gap optimization is an illustration of a computational approach to 3D dispersion engineering which is applicable to many other problems in optics, based on a novel semidefinite-program formulation for nonconvex eigenvalue optimization combined with other techniques such as a simple approach to impose symmetry constraints. We also demonstrate a technique for \emph{robust} topology optimization, in which some uncertainty is included in each voxel and we optimize the worst-case gap, and we show that the resulting band gaps have increased robustness to systematic fabrication errors.
△ Less
Submitted 19 May, 2014; v1 submitted 16 May, 2014;
originally announced May 2014.
-
Fabrication-Adaptive Optimization, with an Application to Photonic Crystal Design
Authors:
Han Men,
Robert M. Freund,
Ngoc C. Nguyen,
Joel Saa-Seoane,
Jaime Peraire
Abstract:
It is often the case that the computed optimal solution of an optimization problem cannot be implemented directly, irrespective of data accuracy, due to either (i) technological limitations (such as physical tolerances of machines or processes), (ii) the deliberate simplification of a model to keep it tractable (by ignoring certain types of constraints that pose computational difficulties), and/or…
▽ More
It is often the case that the computed optimal solution of an optimization problem cannot be implemented directly, irrespective of data accuracy, due to either (i) technological limitations (such as physical tolerances of machines or processes), (ii) the deliberate simplification of a model to keep it tractable (by ignoring certain types of constraints that pose computational difficulties), and/or (iii) human factors (getting people to "do" the optimal solution). Motivated by this observation, we present a modeling paradigm called "fabrication-adaptive optimization" for treating issues of implementation/fabrication. We develop computationally-focused theory and algorithms, and we present computational results for incorporating considerations of implementation/fabrication into constrained optimization problems that arise in photonic crystal design. The fabrication-adaptive optimization framework stems from the robust regularization of a function. When the feasible region is not a normed space (as typically encountered in application settings), the fabrication-adaptive optimization framework typically yields a non-convex optimization problem. (In the special case where the feasible region is a finite-dimensional normed space, we show that fabrication-adaptive optimization can be re-cast as an instance of modern robust optimization.) We study a variety of problems with special structures on functions, feasible regions, and norms, for which computation is tractable, and develop an algorithmic scheme for solving these problems in spite of the challenges of non-convexity. We apply our methodology to compute fabrication-adaptive designs of two-dimensional photonic crystals with a variety of prescribed features.
△ Less
Submitted 19 May, 2014; v1 submitted 21 July, 2013;
originally announced July 2013.
-
AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods
Authors:
Robert M. Freund,
Paul Grigas,
Rahul Mazumder
Abstract:
Boosting methods are highly popular and effective supervised learning methods which combine weak learners into a single accurate model with good statistical performance. In this paper, we analyze two well-known boosting methods, AdaBoost and Incremental Forward Stagewise Regression (FS$_\varepsilon$), by establishing their precise connections to the Mirror Descent algorithm, which is a first-order…
▽ More
Boosting methods are highly popular and effective supervised learning methods which combine weak learners into a single accurate model with good statistical performance. In this paper, we analyze two well-known boosting methods, AdaBoost and Incremental Forward Stagewise Regression (FS$_\varepsilon$), by establishing their precise connections to the Mirror Descent algorithm, which is a first-order method in convex optimization. As a consequence of these connections we obtain novel computational guarantees for these boosting methods. In particular, we characterize convergence bounds of AdaBoost, related to both the margin and log-exponential loss function, for any step-size sequence. Furthermore, this paper presents, for the first time, precise computational complexity results for FS$_\varepsilon$.
△ Less
Submitted 3 July, 2013;
originally announced July 2013.
-
New Analysis and Results for the Frank-Wolfe Method
Authors:
Robert M. Freund,
Paul Grigas
Abstract:
We present new results for the Frank-Wolfe method (also known as the conditional gradient method). We derive computational guarantees for arbitrary step-size sequences, which are then applied to various step-size rules, including simple averaging and constant step-sizes. We also develop step-size rules and computational guarantees that depend naturally on the warm-start quality of the initial (and…
▽ More
We present new results for the Frank-Wolfe method (also known as the conditional gradient method). We derive computational guarantees for arbitrary step-size sequences, which are then applied to various step-size rules, including simple averaging and constant step-sizes. We also develop step-size rules and computational guarantees that depend naturally on the warm-start quality of the initial (and subsequent) iterates. Our results include computational guarantees for both duality/bound gaps and the so-called FW gaps. Lastly, we present complexity bounds in the presence of approximate computation of gradients and/or linear optimization subproblem solutions.
△ Less
Submitted 2 June, 2014; v1 submitted 2 July, 2013;
originally announced July 2013.
-
Band Gap Optimization of Two-Dimensional Photonic Crystals Using Semidefinite Programming and Subspace Methods
Authors:
Han Men,
Ngoc-Cuong Nguyen,
Robert M. Freund,
Pablo A. Parrilo,
Jaume Peraire
Abstract:
In this paper, we consider the optimal design of photonic crystal band structures for two-dimensional square lattices. The mathematical formulation of the band gap optimization problem leads to an infinite-dimensional Hermitian eigenvalue optimization problem parametrized by the dielectric material and the wave vector. To make the problem tractable, the original eigenvalue problem is discretized…
▽ More
In this paper, we consider the optimal design of photonic crystal band structures for two-dimensional square lattices. The mathematical formulation of the band gap optimization problem leads to an infinite-dimensional Hermitian eigenvalue optimization problem parametrized by the dielectric material and the wave vector. To make the problem tractable, the original eigenvalue problem is discretized using the finite element method into a series of finite-dimensional eigenvalue problems for multiple values of the wave vector parameter. The resulting optimization problem is large-scale and non-convex, with low regularity and non-differentiable objective. By restricting to appropriate eigenspaces, we reduce the large-scale non-convex optimization problem via reparametrization to a sequence of small-scale convex semidefinite programs (SDPs) for which modern SDP solvers can be efficiently applied. Numerical results are presented for both transverse magnetic (TM) and transverse electric (TE) polarizations at several frequency bands. The optimized structures exhibit patterns which go far beyond typical physical intuition on periodic media design.
△ Less
Submitted 13 July, 2009;
originally announced July 2009.