Search | arXiv e-print repository

Fully Zeroth-Order Bilevel Programming via Gaussian Smoothing

Abstract: In this paper, we study and analyze zeroth-order stochastic approximation algorithms for solving bilvel problems, when neither the upper/lower objective values, nor their unbiased gradient estimates are available. In particular, exploiting Stein's identity, we first use Gaussian smoothing to estimate first- and second-order partial derivatives of functions with two independent block of variables.… ▽ More In this paper, we study and analyze zeroth-order stochastic approximation algorithms for solving bilvel problems, when neither the upper/lower objective values, nor their unbiased gradient estimates are available. In particular, exploiting Stein's identity, we first use Gaussian smoothing to estimate first- and second-order partial derivatives of functions with two independent block of variables. We then used these estimates in the framework of a stochastic approximation algorithm for solving bilevel optimization problems and establish its non-asymptotic convergence analysis. To the best of our knowledge, this is the first time that sample complexity bounds are established for a fully stochastic zeroth-order bilevel optimization algorithm. △ Less

Submitted 29 March, 2024; originally announced April 2024.

arXiv:2307.05384 [pdf, other]

Stochastic Nested Compositional Bi-level Optimization for Robust Feature Learning

Authors: Xuxing Chen, Krishnakumar Balasubramanian, Saeed Ghadimi

Abstract: We develop and analyze stochastic approximation algorithms for solving nested compositional bi-level optimization problems. These problems involve a nested composition of $T$ potentially non-convex smooth functions in the upper-level, and a smooth and strongly convex function in the lower-level. Our proposed algorithm does not rely on matrix inversions or mini-batches and can achieve an $ε$-statio… ▽ More We develop and analyze stochastic approximation algorithms for solving nested compositional bi-level optimization problems. These problems involve a nested composition of $T$ potentially non-convex smooth functions in the upper-level, and a smooth and strongly convex function in the lower-level. Our proposed algorithm does not rely on matrix inversions or mini-batches and can achieve an $ε$-stationary solution with an oracle complexity of approximately $\tilde{O}_T(1/ε^{2})$, assuming the availability of stochastic first-order oracles for the individual functions in the composition and the lower-level, which are unbiased and have bounded moments. Here, $\tilde{O}_T$ hides polylog factors and constants that depend on $T$. The key challenge we address in establishing this result relates to handling three distinct sources of bias in the stochastic gradients. The first source arises from the compositional nature of the upper-level, the second stems from the bi-level structure, and the third emerges due to the utilization of Neumann series approximations to avoid matrix inversion. To demonstrate the effectiveness of our approach, we apply it to the problem of robust feature learning for deep neural networks under covariate shift, showcasing the benefits and advantages of our methodology in that context. △ Less

Submitted 11 July, 2023; originally announced July 2023.

arXiv:2304.11220 [pdf, other]

Learn What NOT to Learn: Towards Generative Safety in Chatbots

Authors: Leila Khalatbari, Ye** Bang, Dan Su, Willy Chung, Saeed Ghadimi, Hossein Sameti, Pascale Fung

Abstract: Conversational models that are generative and open-domain are particularly susceptible to generating unsafe content since they are trained on web-based social data. Prior approaches to mitigating this issue have drawbacks, such as disrupting the flow of conversation, limited generalization to unseen toxic input contexts, and sacrificing the quality of the dialogue for the sake of safety. In this p… ▽ More Conversational models that are generative and open-domain are particularly susceptible to generating unsafe content since they are trained on web-based social data. Prior approaches to mitigating this issue have drawbacks, such as disrupting the flow of conversation, limited generalization to unseen toxic input contexts, and sacrificing the quality of the dialogue for the sake of safety. In this paper, we present a novel framework, named "LOT" (Learn NOT to), that employs a contrastive loss to enhance generalization by learning from both positive and negative training signals. Our approach differs from the standard contrastive learning framework in that it automatically obtains positive and negative signals from the safe and unsafe language distributions that have been learned beforehand. The LOT framework utilizes divergence to steer the generations away from the unsafe subspace and towards the safe subspace while sustaining the flow of conversation. Our approach is memory and time-efficient during decoding and effectively reduces toxicity while preserving engagingness and fluency. Empirical results indicate that LOT reduces toxicity by up to four-fold while achieving four to six-fold higher rates of engagingness and fluency compared to baseline models. Our findings are further corroborated by human evaluation. △ Less

Submitted 25 April, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

Comments: 9 pages, 3 tables, 3 figures

arXiv:2302.09766 [pdf, other]

A One-Sample Decentralized Proximal Algorithm for Non-Convex Stochastic Composite Optimization

Authors: Tesi Xiao, Xuxing Chen, Krishnakumar Balasubramanian, Saeed Ghadimi

Abstract: We focus on decentralized stochastic non-convex optimization, where $n$ agents work together to optimize a composite objective function which is a sum of a smooth term and a non-smooth convex term. To solve this problem, we propose two single-time scale algorithms: Prox-DASA and Prox-DASA-GT. These algorithms can find $ε$-stationary points in $\mathcal{O}(n^{-1}ε^{-2})$ iterations using constant b… ▽ More We focus on decentralized stochastic non-convex optimization, where $n$ agents work together to optimize a composite objective function which is a sum of a smooth term and a non-smooth convex term. To solve this problem, we propose two single-time scale algorithms: Prox-DASA and Prox-DASA-GT. These algorithms can find $ε$-stationary points in $\mathcal{O}(n^{-1}ε^{-2})$ iterations using constant batch sizes (i.e., $\mathcal{O}(1)$). Unlike prior work, our algorithms achieve comparable complexity without requiring large batch sizes, more complex per-iteration operations (such as double loops), or stronger assumptions. Our theoretical findings are supported by extensive numerical experiments, which demonstrate the superiority of our algorithms over previous approaches. Our code is available at https://github.com/xuxingc/ProxDASA. △ Less

Submitted 22 June, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

Comments: UAI 2023

arXiv:2206.11346 [pdf, other]

Constrained Stochastic Nonconvex Optimization with State-dependent Markov Data

Authors: Abhishek Roy, Krishnakumar Balasubramanian, Saeed Ghadimi

Abstract: We study stochastic optimization algorithms for constrained nonconvex stochastic optimization problems with Markovian data. In particular, we focus on the case when the transition kernel of the Markov chain is state-dependent. Such stochastic optimization problems arise in various machine learning problems including strategic classification and reinforcement learning. For this problem, we study bo… ▽ More We study stochastic optimization algorithms for constrained nonconvex stochastic optimization problems with Markovian data. In particular, we focus on the case when the transition kernel of the Markov chain is state-dependent. Such stochastic optimization problems arise in various machine learning problems including strategic classification and reinforcement learning. For this problem, we study both projection-based and projection-free algorithms. In both cases, we establish that the number of calls to the stochastic first-order oracle to obtain an appropriately defined $ε$-stationary point is of the order $\mathcal{O}(1/ε^{2.5})$. In the projection-free setting we additionally establish that the number of calls to the linear minimization oracle is of order $\mathcal{O}(1/ε^{5.5})$. We also empirically demonstrate the performance of our algorithm on the problem of strategic classification with neural networks. △ Less

Submitted 8 November, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

Comments: 2 figures

arXiv:2205.13635 [pdf, other]

RIGID: Robust Linear Regression with Missing Data

Authors: Alireza Aghasi, MohammadJavad Feizollahi, Saeed Ghadimi

Abstract: We present a robust framework to perform linear regression with missing entries in the features. By considering an elliptical data distribution, and specifically a multivariate normal model, we are able to conditionally formulate a distribution for the missing entries and present a robust framework, which minimizes the worst case error caused by the uncertainty about the missing data. We show that… ▽ More We present a robust framework to perform linear regression with missing entries in the features. By considering an elliptical data distribution, and specifically a multivariate normal model, we are able to conditionally formulate a distribution for the missing entries and present a robust framework, which minimizes the worst case error caused by the uncertainty about the missing data. We show that the proposed formulation, which naturally takes into account the dependency between different variables, ultimately reduces to a convex program, for which a customized and scalable solver can be delivered. In addition to a detailed analysis to deliver such solver, we also asymptoticly analyze the behavior of the proposed framework, and present technical discussions to estimate the required input parameters. We complement our analysis with experiments performed on synthetic, semi-synthetic, and real data, and show how the proposed formulation improves the prediction accuracy and robustness, and outperforms the competing techniques. Missing data is a common problem associated with many datasets in machine learning. With the significant increase in using robust optimization techniques to train machine learning models, this paper presents a novel robust regression framework that operates by minimizing the uncertainty associated with missing data. The proposed approach allows training models with incomplete data, while minimizing the impact of uncertainty associated with the unavailable data. The ideas developed in this paper can be generalized beyond linear models and elliptical data distributions. △ Less

Submitted 8 November, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

arXiv:2204.07317 [pdf, other]

Stochastic Search for a Parametric Cost Function Approximation: Energy storage with rolling forecasts

Authors: Saeed Ghadimi, Warren B. Powell

Abstract: Rolling forecasts have been almost overlooked in the renewable energy storage literature. In this paper, we provide a new approach for handling uncertainty not just in the accuracy of a forecast, but in the evolution of forecasts over time. Our approach shifts the focus from modeling the uncertainty in a lookahead model to accurate simulations in a stochastic base model. We develop a robust policy… ▽ More Rolling forecasts have been almost overlooked in the renewable energy storage literature. In this paper, we provide a new approach for handling uncertainty not just in the accuracy of a forecast, but in the evolution of forecasts over time. Our approach shifts the focus from modeling the uncertainty in a lookahead model to accurate simulations in a stochastic base model. We develop a robust policy for making energy storage decisions by creating a parametrically modified lookahead model, where the parameters are tuned in the stochastic base model. Since computing unbiased stochastic gradients with respect to the parameters require restrictive assumptions, we propose a simulation-based stochastic approximation algorithm based on numerical derivatives to optimize these parameters. While numerical derivatives, calculated based on the noisy function evaluations, provide biased gradient estimates, an online variance reduction technique built in the framework of our proposed algorithm, will enable us to control the accumulated bias errors and establish the finite-time rate of convergence of the algorithm. Our numerical experiments show the performance of this algorithm in finding policies outperforming the deterministic benchmark policy. △ Less

Submitted 14 April, 2022; originally announced April 2022.

arXiv:2202.04296 [pdf, ps, other]

A Projection-free Algorithm for Constrained Stochastic Multi-level Composition Optimization

Authors: Tesi Xiao, Krishnakumar Balasubramanian, Saeed Ghadimi

Abstract: We propose a projection-free conditional gradient-type algorithm for smooth stochastic multi-level composition optimization, where the objective function is a nested composition of $T$ functions and the constraint set is a closed convex set. Our algorithm assumes access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle satisfying certain standard un… ▽ More We propose a projection-free conditional gradient-type algorithm for smooth stochastic multi-level composition optimization, where the objective function is a nested composition of $T$ functions and the constraint set is a closed convex set. Our algorithm assumes access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle satisfying certain standard unbiasedness and second moment assumptions. We show that the number of calls to the stochastic first-order oracle and the linear-minimization oracle required by the proposed algorithm, to obtain an $ε$-stationary solution, are of order $\mathcal{O}_T(ε^{-2})$ and $\mathcal{O}_T(ε^{-3})$ respectively, where $\mathcal{O}_T$ hides constants in $T$. Notably, the dependence of these complexity bounds on $ε$ and $T$ are separate in the sense that changing one does not impact the dependence of the bounds on the other. Moreover, our algorithm is parameter-free and does not require any (increasing) order of mini-batches to converge unlike the common practice in the analysis of stochastic conditional gradient-type algorithms. △ Less

Submitted 9 October, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

Comments: To appear in NeurIPS 2022

arXiv:2201.00258 [pdf, other]

The Parametric Cost Function Approximation: A new approach for multistage stochastic programming

Authors: Warren B Powell, Saeed Ghadimi

Abstract: The most common approaches for solving multistage stochastic programming problems in the research literature have been to either use value functions ("dynamic programming") or scenario trees ("stochastic programming") to approximate the impact of a decision now on the future. By contrast, common industry practice is to use a deterministic approximation of the future which is easier to understand a… ▽ More The most common approaches for solving multistage stochastic programming problems in the research literature have been to either use value functions ("dynamic programming") or scenario trees ("stochastic programming") to approximate the impact of a decision now on the future. By contrast, common industry practice is to use a deterministic approximation of the future which is easier to understand and solve, but which is criticized for ignoring uncertainty. We show that a parameterized version of a deterministic optimization model can be an effective way of handling uncertainty without the complexity of either stochastic programming or dynamic programming. We present the idea of a parameterized deterministic optimization model, and in particular a deterministic lookahead model, as a powerful strategy for many complex stochastic decision problems. This approach can handle complex, high-dimensional state variables, and avoids the usual approximations associated with scenario trees or value function approximations. Instead, it introduces the offline challenge of designing and tuning the parameterization. We illustrate the idea by using a series of application settings, and demonstrate its use in a nonstationary energy storage problem with rolling forecasts. △ Less

Submitted 1 January, 2022; originally announced January 2022.

Comments: 3 figures

MSC Class: 68 ACM Class: F.2; I.2

arXiv:2012.04000 [pdf, other]

Deep Networks to Automatically Detect Late-activating Regions of the Heart

Authors: Jiarui Xing, Sona Ghadimi, Mohammad Abdishektaei, Kenneth C. Bilchick, Frederick H. Epstein, Miaomiao Zhang

Abstract: This paper presents a novel method to automatically identify late-activating regions of the left ventricle from cine Displacement Encoding with Stimulated Echo (DENSE) MR images. We develop a deep learning framework that identifies late mechanical activation in heart failure patients by detecting the Time to the Onset of circumferential Shortening (TOS). In particular, we build a cascade network p… ▽ More This paper presents a novel method to automatically identify late-activating regions of the left ventricle from cine Displacement Encoding with Stimulated Echo (DENSE) MR images. We develop a deep learning framework that identifies late mechanical activation in heart failure patients by detecting the Time to the Onset of circumferential Shortening (TOS). In particular, we build a cascade network performing end-to-end (i) segmentation of the left ventricle to analyze cardiac function, (ii) prediction of TOS based on spatiotemporal circumferential strains computed from displacement maps, and (iii) 3D visualization of delayed activation maps. Our approach results in dramatic savings of manual labors and computational time over traditional optimization-based algorithms. To evaluate the effectiveness of our method, we run tests on cardiac images and compare with recent related works. Experimental results show that the proposed approach provides fast prediction of TOS with improved accuracy. △ Less

Submitted 7 December, 2020; originally announced December 2020.

arXiv:2009.13016 [pdf, ps, other]

Esca** Saddle-Points Faster under Interpolation-like Conditions

Authors: Abhishek Roy, Krishnakumar Balasubramanian, Saeed Ghadimi, Prasant Mohapatra

Abstract: In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an over-parame… ▽ More In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an over-parametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an $ε$-local-minimizer, matches the corresponding deterministic rate of $\tilde{\mathcal{O}}(1/ε^{2})$. We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an $ε$-local-minimizer under interpolation-like conditions, is $\tilde{\mathcal{O}}(1/ε^{2.5})$. While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of $\tilde{\mathcal{O}}(1/ε^{1.5})$ corresponding to deterministic Cubic-Regularized Newton method. It seems further Hessian-based interpolation-like assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zeroth-order settings. △ Less

Submitted 27 September, 2020; originally announced September 2020.

Comments: To appear in NeurIPS, 2020

arXiv:2008.10526 [pdf, other]

Stochastic Multi-level Composition Optimization Algorithms with Level-Independent Convergence Rates

Authors: Krishnakumar Balasubramanian, Saeed Ghadimi, Anthony Nguyen

Abstract: In this paper, we study smooth stochastic multi-level composition optimization problems, where the objective function is a nested composition of $T$ functions. We assume access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle. For solving this class of problems, we propose two algorithms using moving-average stochastic estimates, and analyze their… ▽ More In this paper, we study smooth stochastic multi-level composition optimization problems, where the objective function is a nested composition of $T$ functions. We assume access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle. For solving this class of problems, we propose two algorithms using moving-average stochastic estimates, and analyze their convergence to an $ε$-stationary point of the problem. We show that the first algorithm, which is a generalization of \cite{GhaRuswan20} to the $T$ level case, can achieve a sample complexity of $\mathcal{O}(1/ε^6)$ by using mini-batches of samples in each iteration. By modifying this algorithm using linearized stochastic estimates of the function values, we improve the sample complexity to $\mathcal{O}(1/ε^4)$. {\color{black}This modification not only removes the requirement of having a mini-batch of samples in each iteration, but also makes the algorithm parameter-free and easy to implement}. To the best of our knowledge, this is the first time that such an online algorithm designed for the (un)constrained multi-level setting, obtains the same sample complexity of the smooth single-level setting, under standard assumptions (unbiasedness and boundedness of the second moments) on the stochastic first-order oracle. △ Less

Submitted 14 February, 2022; v1 submitted 24 August, 2020; originally announced August 2020.

Comments: Fixed some typos

arXiv:2006.08167 [pdf, other]

Improved Complexities for Stochastic Conditional Gradient Methods under Interpolation-like Conditions

Authors: Tesi Xiao, Krishnakumar Balasubramanian, Saeed Ghadimi

Abstract: We analyze stochastic conditional gradient methods for constrained optimization problems arising in over-parametrized machine learning. We show that one could leverage the interpolation-like conditions satisfied by such models to obtain improved oracle complexities. Specifically, when the objective function is convex, we show that the conditional gradient method requires $\mathcal{O}(ε^{-2})$ call… ▽ More We analyze stochastic conditional gradient methods for constrained optimization problems arising in over-parametrized machine learning. We show that one could leverage the interpolation-like conditions satisfied by such models to obtain improved oracle complexities. Specifically, when the objective function is convex, we show that the conditional gradient method requires $\mathcal{O}(ε^{-2})$ calls to the stochastic gradient oracle to find an $ε$-optimal solution. Furthermore, by including a gradient sliding step, we show that the number of calls reduces to $\mathcal{O}(ε^{-1.5})$. △ Less

Submitted 26 January, 2022; v1 submitted 15 June, 2020; originally announced June 2020.

arXiv:2001.00831 [pdf, other]

Reinforcement Learning via Parametric Cost Function Approximation for Multistage Stochastic Programming

Authors: Saeed Ghadimi, Raymond T. Perkins, Warren B. Powell

Abstract: The most common approaches for solving stochastic resource allocation problems in the research literature is to either use value functions ("dynamic programming") or scenario trees ("stochastic programming") to approximate the impact of a decision now on the future. By contrast, common industry practice is to use a deterministic approximation of the future which is easier to understand and solve,… ▽ More The most common approaches for solving stochastic resource allocation problems in the research literature is to either use value functions ("dynamic programming") or scenario trees ("stochastic programming") to approximate the impact of a decision now on the future. By contrast, common industry practice is to use a deterministic approximation of the future which is easier to understand and solve, but which is criticized for ignoring uncertainty. We show that a parameterized version of a deterministic lookahead can be an effective way of handling uncertainty, while enjoying the computational simplicity of a deterministic lookahead. We present the parameterized lookahead model as a form of policy for solving a stochastic base model, which is used as the basis for optimizing the parameterized policy. This approach can handle complex, high-dimensional state variables, and avoids the usual approximations associated with scenario trees. We formalize this approach and demonstrate its use in the context of a complex, nonstationary energy storage problem. △ Less

Submitted 2 January, 2020; originally announced January 2020.

Comments: arXiv admin note: text overlap with arXiv:1703.04644

arXiv:1910.03591 [pdf, other]

Robust and efficient algorithms for high-dimensional black-box quantum optimization

Authors: Zhaoqi Leng, Pranav Mundada, Saeed Ghadimi, Andrew Houck

Abstract: Hybrid quantum-classical optimization using near-term quantum technology is an emerging direction for exploring quantum advantage in high-dimensional systems. However, precise characterization of all experimental parameters is often impractical and challenging. A viable approach is to use algorithms that rely only on black-box inference rather than analytical gradients. Here, we combine randomized… ▽ More Hybrid quantum-classical optimization using near-term quantum technology is an emerging direction for exploring quantum advantage in high-dimensional systems. However, precise characterization of all experimental parameters is often impractical and challenging. A viable approach is to use algorithms that rely only on black-box inference rather than analytical gradients. Here, we combine randomized perturbation gradient estimation with adaptive momentum gradient updates to create the AdamSPSA and AdamRSGF algorithms. We prove the asymptotic convergence of our algorithms in a convex setting, and we benchmark them against other gradient-based optimization algorithms on non-convex optimal control tasks. Our results show that these new algorithms accelerate the convergence rate, decrease the variance of loss trajectories, and efficiently tune up high-fidelity (above 99.9\%) Hann-window single-qubit gates from trivial initial conditions with twenty variables. △ Less

Submitted 10 October, 2019; v1 submitted 8 October, 2019; originally announced October 2019.

arXiv:1907.13616 [pdf, ps, other]

Multi-Point Bandit Algorithms for Nonstationary Online Nonconvex Optimization

Authors: Abhishek Roy, Krishnakumar Balasubramanian, Saeed Ghadimi, Prasant Mohapatra

Abstract: Bandit algorithms have been predominantly analyzed in the convex setting with function-value based stationary regret as the performance measure. In this paper, motivated by online reinforcement learning problems, we propose and analyze bandit algorithms for both general and structured nonconvex problems with nonstationary (or dynamic) regret as the performance measure, in both stochastic and non-s… ▽ More Bandit algorithms have been predominantly analyzed in the convex setting with function-value based stationary regret as the performance measure. In this paper, motivated by online reinforcement learning problems, we propose and analyze bandit algorithms for both general and structured nonconvex problems with nonstationary (or dynamic) regret as the performance measure, in both stochastic and non-stochastic settings. First, for general nonconvex functions, we consider nonstationary versions of first-order and second-order stationary solutions as a regret measure, motivated by similar performance measures for offline nonconvex optimization. In the case of second-order stationary solution based regret, we propose and analyze online and bandit versions of the cubic regularized Newton's method. The bandit version is based on estimating the Hessian matrices in the bandit setting, based on second-order Gaussian Stein's identity. Our nonstationary regret bounds in terms of second-order stationary solutions have interesting consequences for avoiding saddle points in the bandit setting. Next, for weakly quasi convex functions and monotone weakly submodular functions we consider nonstationary regret measures in terms of function-values; such structured classes of nonconvex functions enable one to consider regret measure defined in terms of function values, similar to convex functions. For this case of function-value, and first-order stationary solution based regret measures, we provide regret bounds in both the low- and high-dimensional settings, for some scenarios. △ Less

Submitted 11 September, 2019; v1 submitted 31 July, 2019; originally announced July 2019.

arXiv:1902.01373 [pdf, ps, other]

Stochastic Zeroth-order Discretizations of Langevin Diffusions for Bayesian Inference

Authors: Abhishek Roy, Lingqing Shen, Krishnakumar Balasubramanian, Saeed Ghadimi

Abstract: Discretizations of Langevin diffusions provide a powerful method for sampling and Bayesian inference. However, such discretizations require evaluation of the gradient of the potential function. In several real-world scenarios, obtaining gradient evaluations might either be computationally expensive, or simply impossible. In this work, we propose and analyze stochastic zeroth-order sampling algorit… ▽ More Discretizations of Langevin diffusions provide a powerful method for sampling and Bayesian inference. However, such discretizations require evaluation of the gradient of the potential function. In several real-world scenarios, obtaining gradient evaluations might either be computationally expensive, or simply impossible. In this work, we propose and analyze stochastic zeroth-order sampling algorithms for discretizing overdamped and underdamped Langevin diffusions. Our approach is based on estimating the gradients, based on Gaussian Stein's identities, widely used in the stochastic optimization literature. We provide a comprehensive sample complexity analysis -- number noisy function evaluations to be made to obtain an $ε$-approximate sample in Wasserstein distance -- of stochastic zeroth-order discretizations of both overdamped and underdamped Langevin diffusions, under various noise models. We also propose a variable selection technique based on zeroth-order gradient estimates and establish its theoretical guarantees. Our theoretical contributions extend the practical applicability of sampling algorithms to the noisy black-box and high-dimensional settings. △ Less

Submitted 17 January, 2021; v1 submitted 4 February, 2019; originally announced February 2019.

arXiv:1812.01094 [pdf, ps, other]

A Single Time-Scale Stochastic Approximation Method for Nested Stochastic Optimization

Authors: Saeed Ghadimi, Andrzej Ruszczyński, Mengdi Wang

Abstract: We study constrained nested stochastic optimization problems in which the objective function is a composition of two smooth functions whose exact values and derivatives are not available. We propose a single time-scale stochastic approximation algorithm, which we call the Nested Averaged Stochastic Approximation (NASA), to find an approximate stationary point of the problem. The algorithm has two… ▽ More We study constrained nested stochastic optimization problems in which the objective function is a composition of two smooth functions whose exact values and derivatives are not available. We propose a single time-scale stochastic approximation algorithm, which we call the Nested Averaged Stochastic Approximation (NASA), to find an approximate stationary point of the problem. The algorithm has two auxiliary averaged sequences (filters) which estimate the gradient of the composite objective function and the inner function value. By using a special Lyapunov function, we show that NASA achieves the sample complexity of ${\cal O}(1/ε^{2})$ for finding an $ε$-approximate stationary point, thus outperforming all extant methods for nested stochastic approximation. Our method and its analysis are the same for both unconstrained and constrained problems, without any need of batch samples for constrained nonconvex stochastic optimization. We also present a simplified variant of the NASA method for solving constrained single level stochastic optimization problems, and we prove the same complexity result for both unconstrained and constrained problems. △ Less

Submitted 6 September, 2019; v1 submitted 3 December, 2018; originally announced December 2018.

arXiv:1809.06474 [pdf, ps, other]

Zeroth-order Nonconvex Stochastic Optimization: Handling Constraints, High-Dimensionality and Saddle-Points

Authors: Krishnakumar Balasubramanian, Saeed Ghadimi

Abstract: In this paper, we propose and analyze zeroth-order stochastic approximation algorithms for nonconvex and convex optimization, with a focus on addressing constrained optimization, high-dimensional setting and saddle-point avoiding. To handle constrained optimization, we first propose generalizations of the conditional gradient algorithm achieving rates similar to the standard stochastic gradient al… ▽ More In this paper, we propose and analyze zeroth-order stochastic approximation algorithms for nonconvex and convex optimization, with a focus on addressing constrained optimization, high-dimensional setting and saddle-point avoiding. To handle constrained optimization, we first propose generalizations of the conditional gradient algorithm achieving rates similar to the standard stochastic gradient algorithm using only zeroth-order information. To facilitate zeroth-order optimization in high-dimensions, we explore the advantages of structural sparsity assumptions. Specifically, (i) we highlight an implicit regularization phenomenon where the standard stochastic gradient algorithm with zeroth-order information adapts to the sparsity of the problem at hand by just varying the step-size and (ii) propose a truncated stochastic gradient algorithm with zeroth-order information, whose rate of convergence depends only poly-logarithmically on the dimensionality. We next focus on avoiding saddle-points in non-convex setting. Towards that, we interpret the Gaussian smoothing technique for estimating gradient based on zeroth-order information as an instantiation of first-order Stein's identity. Based on this, we provide a novel linear-(in dimension) time estimator of the Hessian matrix of a function using only zeroth-order information, which is based on second-order Stein's identity. We then provide an algorithm for avoiding saddle-points, which is based on a zeroth-order cubic regularization Newton's method and discuss its convergence rates. △ Less

Submitted 13 January, 2019; v1 submitted 17 September, 2018; originally announced September 2018.

arXiv:1802.02246 [pdf, ps, other]

Approximation Methods for Bilevel Programming

Authors: Saeed Ghadimi, Mengdi Wang

Abstract: In this paper, we study a class of bilevel programming problem where the inner objective function is strongly convex. More specifically, under some mile assumptions on the partial derivatives of both inner and outer objective functions, we present an approximation algorithm for solving this class of problem and provide its finite-time convergence analysis under different convexity assumption on th… ▽ More In this paper, we study a class of bilevel programming problem where the inner objective function is strongly convex. More specifically, under some mile assumptions on the partial derivatives of both inner and outer objective functions, we present an approximation algorithm for solving this class of problem and provide its finite-time convergence analysis under different convexity assumption on the outer objective function. We also present an accelerated variant of this method which improves the rate of convergence under convexity assumption. Furthermore, we generalize our results under stochastic setting where only noisy information of both objective functions is available. To the best of our knowledge, this is the first time that such (stochastic) approximation algorithms with established iteration complexity (sample complexity) are provided for bilevel programming. △ Less

Submitted 6 February, 2018; originally announced February 2018.

arXiv:1710.05782 [pdf, other]

Second-Order Methods with Cubic Regularization Under Inexact Information

Authors: Saeed Ghadimi, Han Liu, Tong Zhang

Abstract: In this paper, we generalize (accelerated) Newton's method with cubic regularization under inexact second-order information for (strongly) convex optimization problems. Under mild assumptions, we provide global rate of convergence of these methods and show the explicit dependence of the rate of convergence on the problem parameters. While the complexity bounds of our presented algorithms are theor… ▽ More In this paper, we generalize (accelerated) Newton's method with cubic regularization under inexact second-order information for (strongly) convex optimization problems. Under mild assumptions, we provide global rate of convergence of these methods and show the explicit dependence of the rate of convergence on the problem parameters. While the complexity bounds of our presented algorithms are theoretically worse than those of their exact counterparts, they are at least as good as those of the optimal first-order methods. Our numerical experiments also show that using inexact Hessians can significantly speed up the algorithms in practice. △ Less

Submitted 16 October, 2017; originally announced October 2017.

arXiv:1602.00961 [pdf, ps, other]

Conditional gradient type methods for composite nonlinear and stochastic optimization

Authors: Saeed Ghadimi

Abstract: In this paper, we present a conditional gradient type (CGT) method for solving a class of composite optimization problems where the objective function consists of a (weakly) smooth term and a (strongly) convex regularization term. While including a strongly convex term in the subproblems of the classical conditional gradient (CG) method improves its rate of convergence, it does not cost per iterat… ▽ More In this paper, we present a conditional gradient type (CGT) method for solving a class of composite optimization problems where the objective function consists of a (weakly) smooth term and a (strongly) convex regularization term. While including a strongly convex term in the subproblems of the classical conditional gradient (CG) method improves its rate of convergence, it does not cost per iteration as much as general proximal type algorithms. More specifically, we present a unified analysis for the CGT method in the sense that it achieves the best-known rate of convergence when the weakly smooth term is nonconvex and possesses (nearly) optimal complexity if it turns out to be convex. While implementation of the CGT method requires explicitly estimating problem parameters like the level of smoothness of the first term in the objective function, we also present a few variants of this method which relax such estimation. Unlike general proximal type parameter free methods, these variants of the CGT method do not require any additional effort for computing (sub)gradients of the objective function and/or solving extra subproblems at each iteration. We then generalize these methods under stochastic setting and present a few new complexity results. To the best of our knowledge, this is the first time that such complexity results are presented for solving stochastic weakly smooth nonconvex and (strongly) convex optimization problems. △ Less

Submitted 1 January, 2018; v1 submitted 2 February, 2016; originally announced February 2016.

arXiv:1508.07384 [pdf, ps, other]

Generalized Uniformly Optimal Methods for Nonlinear Programming

Authors: Saeed Ghadimi, Guanghui Lan, Hongchao Zhang

Abstract: In this paper, we present a generic framework to extend existing uniformly optimal convex programming algorithms to solve more general nonlinear, possibly nonconvex, optimization problems. The basic idea is to incorporate a local search step (gradient descent or Quasi-Newton iteration) into these uniformly optimal convex programming methods, and then enforce a monotone decreasing property of the f… ▽ More In this paper, we present a generic framework to extend existing uniformly optimal convex programming algorithms to solve more general nonlinear, possibly nonconvex, optimization problems. The basic idea is to incorporate a local search step (gradient descent or Quasi-Newton iteration) into these uniformly optimal convex programming methods, and then enforce a monotone decreasing property of the function values computed along the trajectory. Algorithms of these types will then achieve the best known complexity for nonconvex problems, and the optimal complexity for convex ones without requiring any problem parameters. As a consequence, we can have a unified treatment for a general class of nonlinear programming problems regardless of their convexity and smoothness level. In particular, we show that the accelerated gradient and level methods, both originally designed for solving convex optimization problems only, can be used for solving both convex and nonconvex problems uniformly. In a similar vein, we show that some well-studied techniques for nonlinear programming, e.g., Quasi-Newton iteration, can be embedded into optimal convex optimization algorithms to possibly further enhance their numerical performance. Our theoretical and algorithmic developments are complemented by some promising numerical results obtained for solving a few important nonconvex and nonlinear data analysis problems in the literature. △ Less

Submitted 12 September, 2015; v1 submitted 28 August, 2015; originally announced August 2015.

arXiv:1310.3787 [pdf, ps, other]

Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming

Authors: Saeed Ghadimi, Guanghui Lan

Abstract: In this paper, we generalize the well-known Nesterov's accelerated gradient (AG) method, originally designed for convex smooth optimization, to solve nonconvex and possibly stochastic optimization problems. We demonstrate that by properly specifying the stepsize policy, the AG method exhibits the best known rate of convergence for solving general nonconvex smooth optimization problems by using fir… ▽ More In this paper, we generalize the well-known Nesterov's accelerated gradient (AG) method, originally designed for convex smooth optimization, to solve nonconvex and possibly stochastic optimization problems. We demonstrate that by properly specifying the stepsize policy, the AG method exhibits the best known rate of convergence for solving general nonconvex smooth optimization problems by using first-order information, similarly to the gradient descent method. We then consider an important class of composite optimization problems and show that the AG method can solve them uniformly, i.e., by using the same aggressive stepsize policy as in the convex case, even if the problem turns out to be nonconvex. We demonstrate that the AG method exhibits an optimal rate of convergence if the composite problem is convex, and improves the best known rate of convergence if the problem is nonconvex. Based on the AG method, we also present new nonconvex stochastic approximation methods and show that they can improve a few existing rates of convergence for nonconvex stochastic optimization. To the best of our knowledge, this is the first time that the convergence of the AG method has been established for solving nonconvex nonlinear programming in the literature. △ Less

Submitted 14 October, 2013; originally announced October 2013.

arXiv:1309.5549 [pdf, ps, other]

Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming

Authors: Saeed Ghadimi, Guanghui Lan

Abstract: In this paper, we introduce a new stochastic approximation (SA) type algorithm, namely the randomized stochastic gradient (RSG) method, for solving an important class of nonlinear (possibly nonconvex) stochastic programming (SP) problems. We establish the complexity of this method for computing an approximate stationary point of a nonlinear programming problem. We also show that this method posses… ▽ More In this paper, we introduce a new stochastic approximation (SA) type algorithm, namely the randomized stochastic gradient (RSG) method, for solving an important class of nonlinear (possibly nonconvex) stochastic programming (SP) problems. We establish the complexity of this method for computing an approximate stationary point of a nonlinear programming problem. We also show that this method possesses a nearly optimal rate of convergence if the problem is convex. We discuss a variant of the algorithm which consists of applying a post-optimization phase to evaluate a short list of solutions generated by several independent runs of the RSG method, and show that such modification allows to improve significantly the large-deviation properties of the algorithm. These methods are then specialized for solving a class of simulation-based optimization problems in which only stochastic zeroth-order information is available. △ Less

Submitted 21 September, 2013; originally announced September 2013.

arXiv:1308.6594 [pdf, ps, other]

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization

Authors: Saeed Ghadimi, Guanghui Lan, Hongchao Zhang

Abstract: This paper considers a class of constrained stochastic composite optimization problems whose objective function is given by the summation of a differentiable (possibly nonconvex) component, together with a certain non-differentiable (but convex) component. In order to solve these problems, we propose a randomized stochastic projected gradient (RSPG) algorithm, in which proper mini-batch of samples… ▽ More This paper considers a class of constrained stochastic composite optimization problems whose objective function is given by the summation of a differentiable (possibly nonconvex) component, together with a certain non-differentiable (but convex) component. In order to solve these problems, we propose a randomized stochastic projected gradient (RSPG) algorithm, in which proper mini-batch of samples are taken at each iteration depending on the total budget of stochastic samples allowed. The RSPG algorithm also employs a general distance function to allow taking advantage of the geometry of the feasible region. Complexity of this algorithm is established in a unified setting, which shows nearly optimal complexity of the algorithm for convex stochastic programming. A post-optimization phase is also proposed to significantly reduce the variance of the solutions returned by the algorithm. In addition, based on the RSPG algorithm, a stochastic gradient free algorithm, which only uses the stochastic zeroth-order information, has been also discussed. Some preliminary numerical results are also provided. △ Less

Submitted 5 September, 2013; v1 submitted 29 August, 2013; originally announced August 2013.

Comments: 32 pages

Showing 1–26 of 26 results for author: Ghadimi, S