Search | arXiv e-print repository

arXiv:2003.03391 [pdf, other]

doi 10.1016/j.ijengsci.2020.103278

Scattering under Linear Non Self-Adjoint Operators: Case of in-Plane Elastic Waves

Authors: Amir Ashkan Mokhtari, Yan Lu, Qiyuan Zhou, Alireza V. Amirkhizi, Ankit Srivastava

Abstract: In this paper, we consider the problem of the scattering of in-plane waves at an interface between a homogeneous medium and a metamaterial. The relevant eigenmodes in the two regions are calculated by solving a recently described non self-adjoint eigenvalue problem particularly suited to scattering studies. The method efficiently produces all propagating and evanescent modes consistent with the ap… ▽ More In this paper, we consider the problem of the scattering of in-plane waves at an interface between a homogeneous medium and a metamaterial. The relevant eigenmodes in the two regions are calculated by solving a recently described non self-adjoint eigenvalue problem particularly suited to scattering studies. The method efficiently produces all propagating and evanescent modes consistent with the application of Snell's law and is applicable to very general scattering problems. In a model composite, we elucidate the emergence of a rich spectrum of eigenvalue degeneracies. These degeneracies appear in both the complex and real domains of the wave-vector. However, since this problem is non self-adjoint, these degeneracies generally represent a coalescing of both the eigenvalues and eigenvectors (exceptional points). Through explicit calculations of Poynting vector, we point out an intriguing phenomenon: there always appears to be an abrupt change in the sign of the refraction angle of the wave on two sides of an exceptional point. Furthermore, the presence of these degeneracies, in some cases, hints at fast changes in the scattered field as the incident angle is changed by small amounts. We calculate these scattered fields through a novel application of the Betti-Rayleigh reciprocity theorem. We present several numerical examples showing a rich scattering spectrum. In one particularly intriguing example, we point out wave behavior which may be related to the phenomenon of resonance trap**. We also show that there exists a deep connection between energy flux conservation and the biorthogonality relationship of the non self-adjoint problem. The proof applies to the general class of scattering problems involving elastic waves (under self-adjoint or non self-adjoint operators). △ Less

Submitted 6 March, 2020; originally announced March 2020.

arXiv:2002.09964 [pdf, other]

Quantized Decentralized Stochastic Learning over Directed Graphs

Authors: Hossein Taheri, Aryan Mokhtari, Hamed Hassani, Ramtin Pedarsani

Abstract: We consider a decentralized stochastic learning problem where data points are distributed among computing nodes communicating over a directed graph. As the model size gets large, decentralized learning faces a major bottleneck that is the heavy communication load due to each node transmitting large messages (model updates) to its neighbors. To tackle this bottleneck, we propose the quantized decen… ▽ More We consider a decentralized stochastic learning problem where data points are distributed among computing nodes communicating over a directed graph. As the model size gets large, decentralized learning faces a major bottleneck that is the heavy communication load due to each node transmitting large messages (model updates) to its neighbors. To tackle this bottleneck, we propose the quantized decentralized stochastic learning algorithm over directed graphs that is based on the push-sum algorithm in decentralized consensus optimization. More importantly, we prove that our algorithm achieves the same convergence rates of the decentralized stochastic learning algorithm with exact-communication for both convex and non-convex losses. Numerical evaluations corroborate our main theoretical results and illustrate significant speed-up compared to the exact-communication methods. △ Less

Submitted 28 December, 2020; v1 submitted 23 February, 2020; originally announced February 2020.

arXiv:2002.07948 [pdf, other]

Personalized Federated Learning: A Meta-Learning Approach

Authors: Alireza Fallah, Aryan Mokhtari, Asuman Ozdaglar

Abstract: In Federated Learning, we aim to train models across multiple computing units (users), while users can only communicate with a common central server, without exchanging their data samples. This mechanism exploits the computational power of all users and allows users to obtain a richer model as their models are trained over a larger set of data points. However, this scheme only develops a common ou… ▽ More In Federated Learning, we aim to train models across multiple computing units (users), while users can only communicate with a common central server, without exchanging their data samples. This mechanism exploits the computational power of all users and allows users to obtain a richer model as their models are trained over a larger set of data points. However, this scheme only develops a common output for all the users, and, therefore, it does not adapt the model to each user. This is an important missing feature, especially given the heterogeneity of the underlying data distribution for various users. In this paper, we study a personalized variant of the federated learning in which our goal is to find an initial shared model that current or new users can easily adapt to their local dataset by performing one or a few steps of gradient descent with respect to their own data. This approach keeps all the benefits of the federated learning architecture, and, by structure, leads to a more personalized model for each user. We show this problem can be studied within the Model-Agnostic Meta-Learning (MAML) framework. Inspired by this connection, we study a personalized variant of the well-known Federated Averaging algorithm and evaluate its performance in terms of gradient norm for non-convex loss functions. Further, we characterize how this performance is affected by the closeness of underlying distributions of user data, measured in terms of distribution distances such as Total Variation and 1-Wasserstein metric. △ Less

Submitted 22 October, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: To appear in 34th Conference on Neural Information Processing Systems (NeurIPS 2020)

arXiv:2002.05135 [pdf, other]

On the Convergence Theory of Debiased Model-Agnostic Meta-Reinforcement Learning

Authors: Alireza Fallah, Kristian Georgiev, Aryan Mokhtari, Asuman Ozdaglar

Abstract: We consider Model-Agnostic Meta-Learning (MAML) methods for Reinforcement Learning (RL) problems, where the goal is to find a policy using data from several tasks represented by Markov Decision Processes (MDPs) that can be updated by one step of stochastic policy gradient for the realized MDP. In particular, using stochastic gradients in MAML update steps is crucial for RL problems since computati… ▽ More We consider Model-Agnostic Meta-Learning (MAML) methods for Reinforcement Learning (RL) problems, where the goal is to find a policy using data from several tasks represented by Markov Decision Processes (MDPs) that can be updated by one step of stochastic policy gradient for the realized MDP. In particular, using stochastic gradients in MAML update steps is crucial for RL problems since computation of exact gradients requires access to a large number of possible trajectories. For this formulation, we propose a variant of the MAML method, named Stochastic Gradient Meta-Reinforcement Learning (SG-MRL), and study its convergence properties. We derive the iteration and sample complexity of SG-MRL to find an $ε$-first-order stationary point, which, to the best of our knowledge, provides the first convergence guarantee for model-agnostic meta-reinforcement learning algorithms. We further show how our results extend to the case where more than one step of stochastic policy gradient method is used at test time. Finally, we empirically compare SG-MRL and MAML in several deep RL environments. △ Less

Submitted 16 November, 2021; v1 submitted 12 February, 2020; originally announced February 2020.

Comments: 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

arXiv:2002.04766 [pdf, other]

Task-Robust Model-Agnostic Meta-Learning

Authors: Liam Collins, Aryan Mokhtari, Sanjay Shakkottai

Abstract: Meta-learning methods have shown an impressive ability to train models that rapidly learn new tasks. However, these methods only aim to perform well in expectation over tasks coming from some particular distribution that is typically equivalent across meta-training and meta-testing, rather than considering worst-case task performance. In this work we introduce the notion of "task-robustness" by re… ▽ More Meta-learning methods have shown an impressive ability to train models that rapidly learn new tasks. However, these methods only aim to perform well in expectation over tasks coming from some particular distribution that is typically equivalent across meta-training and meta-testing, rather than considering worst-case task performance. In this work we introduce the notion of "task-robustness" by reformulating the popular Model-Agnostic Meta-Learning (MAML) objective [Finn et al. 2017] such that the goal is to minimize the maximum loss over the observed meta-training tasks. The solution to this novel formulation is task-robust in the sense that it places equal importance on even the most difficult and/or rare tasks. This also means that it performs well over all distributions of the observed tasks, making it robust to shifts in the task distribution between meta-training and meta-testing. We present an algorithm to solve the proposed min-max problem, and show that it converges to an $ε$-accurate point at the optimal rate of $\mathcal{O}(1/ε^2)$ in the convex setting and to an $(ε, δ)$-stationary point at the rate of $\mathcal{O}(\max\{1/ε^5, 1/δ^5\})$ in nonconvex settings. We also provide an upper bound on the new task generalization error that captures the advantage of minimizing the worst-case task loss, and demonstrate this advantage in sinusoid regression and image classification experiments. △ Less

Submitted 18 June, 2020; v1 submitted 11 February, 2020; originally announced February 2020.

arXiv:1910.14380 [pdf, other]

A Decentralized Proximal Point-type Method for Saddle Point Problems

Authors: Weijie Liu, Aryan Mokhtari, Asuman Ozdaglar, Sarath Pattathil, Zebang Shen, Nenggan Zheng

Abstract: In this paper, we focus on solving a class of constrained non-convex non-concave saddle point problems in a decentralized manner by a group of nodes in a network. Specifically, we assume that each node has access to a summand of a global objective function and nodes are allowed to exchange information only with their neighboring nodes. We propose a decentralized variant of the proximal point metho… ▽ More In this paper, we focus on solving a class of constrained non-convex non-concave saddle point problems in a decentralized manner by a group of nodes in a network. Specifically, we assume that each node has access to a summand of a global objective function and nodes are allowed to exchange information only with their neighboring nodes. We propose a decentralized variant of the proximal point method for solving this problem. We show that when the objective function is $ρ$-weakly convex-weakly concave the iterates converge to approximate stationarity with a rate of $\mathcal{O}(1/\sqrt{T})$ where the approximation error depends linearly on $\sqrtρ$. We further show that when the objective function satisfies the Minty VI condition (which generalizes the convex-concave case) we obtain convergence to stationarity with a rate of $\mathcal{O}(1/\sqrt{T})$. To the best of our knowledge, our proposed method is the first decentralized algorithm with theoretical guarantees for solving a non-convex non-concave decentralized saddle point problem. Our numerical results for training a general adversarial network (GAN) in a decentralized manner match our theoretical guarantees. △ Less

Submitted 31 October, 2019; originally announced October 2019.

Comments: 18 pages

arXiv:1910.04322 [pdf, other]

One Sample Stochastic Frank-Wolfe

Authors: Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed Hassani, Amin Karbasi

Abstract: One of the beauties of the projected gradient descent method lies in its rather simple mechanism and yet stable behavior with inexact, stochastic gradients, which has led to its wide-spread use in many machine learning applications. However, once we replace the projection operator with a simpler linear program, as is done in the Frank-Wolfe method, both simplicity and stability take a serious hit.… ▽ More One of the beauties of the projected gradient descent method lies in its rather simple mechanism and yet stable behavior with inexact, stochastic gradients, which has led to its wide-spread use in many machine learning applications. However, once we replace the projection operator with a simpler linear program, as is done in the Frank-Wolfe method, both simplicity and stability take a serious hit. The aim of this paper is to bring them back without sacrificing the efficiency. In this paper, we propose the first one-sample stochastic Frank-Wolfe algorithm, called 1-SFW, that avoids the need to carefully tune the batch size, step size, learning rate, and other complicated hyper parameters. In particular, 1-SFW achieves the optimal convergence rate of $\mathcal{O}(1/ε^2)$ for reaching an $ε$-suboptimal solution in the stochastic convex setting, and a $(1-1/e)-ε$ approximate solution for a stochastic monotone DR-submodular maximization problem. Moreover, in a general non-convex setting, 1-SFW finds an $ε$-first-order stationary point after at most $\mathcal{O}(1/ε^3)$ iterations, achieving the current best known convergence rate. All of this is possible by designing a novel unbiased momentum estimator that governs the stability of the optimization process while using a single sample at each iteration. △ Less

Submitted 9 October, 2019; originally announced October 2019.

arXiv:1909.13014 [pdf, other]

FedPAQ: A Communication-Efficient Federated Learning Method with Periodic Averaging and Quantization

Authors: Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, Ramtin Pedarsani

Abstract: Federated learning is a distributed framework according to which a model is trained over a set of devices, while kee** data localized. This framework faces several systems-oriented challenges which include (i) communication bottleneck since a large number of devices upload their local updates to a parameter server, and (ii) scalability as the federated network consists of millions of devices. Du… ▽ More Federated learning is a distributed framework according to which a model is trained over a set of devices, while kee** data localized. This framework faces several systems-oriented challenges which include (i) communication bottleneck since a large number of devices upload their local updates to a parameter server, and (ii) scalability as the federated network consists of millions of devices. Due to these systems challenges as well as issues related to statistical heterogeneity of data and privacy concerns, designing a provably efficient federated learning method is of significant importance yet it remains challenging. In this paper, we present FedPAQ, a communication-efficient Federated Learning method with Periodic Averaging and Quantization. FedPAQ relies on three key features: (1) periodic averaging where models are updated locally at devices and only periodically averaged at the server; (2) partial device participation where only a fraction of devices participate in each round of the training; and (3) quantized message-passing where the edge nodes quantize their updates before uploading to the parameter server. These features address the communications and scalability challenges in federated learning. We also show that FedPAQ achieves near-optimal theoretical guarantees for strongly convex and non-convex loss functions and empirically demonstrate the communication-computation tradeoff provided by our method. △ Less

Submitted 7 June, 2020; v1 submitted 27 September, 2019; originally announced September 2019.

arXiv:1908.10400 [pdf, other]

On the Convergence Theory of Gradient-Based Model-Agnostic Meta-Learning Algorithms

Authors: Alireza Fallah, Aryan Mokhtari, Asuman Ozdaglar

Abstract: We study the convergence of a class of gradient-based Model-Agnostic Meta-Learning (MAML) methods and characterize their overall complexity as well as their best achievable accuracy in terms of gradient norm for nonconvex loss functions. We start with the MAML method and its first-order approximation (FO-MAML) and highlight the challenges that emerge in their analysis. By overcoming these challeng… ▽ More We study the convergence of a class of gradient-based Model-Agnostic Meta-Learning (MAML) methods and characterize their overall complexity as well as their best achievable accuracy in terms of gradient norm for nonconvex loss functions. We start with the MAML method and its first-order approximation (FO-MAML) and highlight the challenges that emerge in their analysis. By overcoming these challenges not only we provide the first theoretical guarantees for MAML and FO-MAML in nonconvex settings, but also we answer some of the unanswered questions for the implementation of these algorithms including how to choose their learning rate and the batch size for both tasks and datasets corresponding to tasks. In particular, we show that MAML can find an $ε$-first-order stationary point ($ε$-FOSP) for any positive $ε$ after at most $\mathcal{O}(1/ε^2)$ iterations at the expense of requiring second-order information. We also show that FO-MAML which ignores the second-order information required in the update of MAML cannot achieve any small desired level of accuracy, i.e., FO-MAML cannot find an $ε$-FOSP for any $ε>0$. We further propose a new variant of the MAML algorithm called Hessian-free MAML which preserves all theoretical guarantees of MAML, without requiring access to second-order information. △ Less

Submitted 15 May, 2020; v1 submitted 27 August, 2019; originally announced August 2019.

Comments: To appear in the proceedings of the $23^{rd}$ International Conference on Artificial Intelligence and Statistics (AISTATS) 2020

arXiv:1907.10595 [pdf, other]

Robust and Communication-Efficient Collaborative Learning

Authors: Amirhossein Reisizadeh, Hossein Taheri, Aryan Mokhtari, Hamed Hassani, Ramtin Pedarsani

Abstract: We consider a decentralized learning problem, where a set of computing nodes aim at solving a non-convex optimization problem collaboratively. It is well-known that decentralized optimization schemes face two major system bottlenecks: stragglers' delay and communication overhead. In this paper, we tackle these bottlenecks by proposing a novel decentralized and gradient-based optimization algorithm… ▽ More We consider a decentralized learning problem, where a set of computing nodes aim at solving a non-convex optimization problem collaboratively. It is well-known that decentralized optimization schemes face two major system bottlenecks: stragglers' delay and communication overhead. In this paper, we tackle these bottlenecks by proposing a novel decentralized and gradient-based optimization algorithm named as QuanTimed-DSGD. Our algorithm stands on two main ideas: (i) we impose a deadline on the local gradient computations of each node at each iteration of the algorithm, and (ii) the nodes exchange quantized versions of their local models. The first idea robustifies to straggling nodes and the second alleviates communication efficiency. The key technical contribution of our work is to prove that with non-vanishing noises for quantization and stochastic gradients, the proposed method exactly converges to the global optimal for convex loss functions, and finds a first-order stationary point in non-convex scenarios. Our numerical evaluations of the QuanTimed-DSGD on training benchmark datasets, MNIST and CIFAR-10, demonstrate speedups of up to 3x in run-time, compared to state-of-the-art decentralized optimization methods. △ Less

Submitted 31 October, 2019; v1 submitted 24 July, 2019; originally announced July 2019.

arXiv:1906.01115 [pdf, ps, other]

Convergence Rate of $\mathcal{O}(1/k)$ for Optimistic Gradient and Extra-gradient Methods in Smooth Convex-Concave Saddle Point Problems

Authors: Aryan Mokhtari, Asuman Ozdaglar, Sarath Pattathil

Abstract: We study the iteration complexity of the optimistic gradient descent-ascent (OGDA) method and the extra-gradient (EG) method for finding a saddle point of a convex-concave unconstrained min-max problem. To do so, we first show that both OGDA and EG can be interpreted as approximate variants of the proximal point method. This is similar to the approach taken in [Nemirovski, 2004] which analyzes EG… ▽ More We study the iteration complexity of the optimistic gradient descent-ascent (OGDA) method and the extra-gradient (EG) method for finding a saddle point of a convex-concave unconstrained min-max problem. To do so, we first show that both OGDA and EG can be interpreted as approximate variants of the proximal point method. This is similar to the approach taken in [Nemirovski, 2004] which analyzes EG as an approximation of the `conceptual mirror prox'. In this paper, we highlight how gradients used in OGDA and EG try to approximate the gradient of the Proximal Point method. We then exploit this interpretation to show that both algorithms produce iterates that remain within a bounded set. We further show that the primal dual gap of the averaged iterates generated by both of these algorithms converge with a rate of $\mathcal{O}(1/k)$. Our theoretical analysis is of interest as it provides a the first convergence rate estimate for OGDA in the general convex-concave setting. Moreover, it provides a simple convergence analysis for the EG algorithm in terms of function value without using compactness assumption. △ Less

Submitted 29 September, 2020; v1 submitted 3 June, 2019; originally announced June 2019.

Comments: 19 pages

arXiv:1906.00506 [pdf, ps, other]

DAve-QN: A Distributed Averaged Quasi-Newton Method with Local Superlinear Convergence Rate

Authors: Saeed Soori, Konstantin Mischenko, Aryan Mokhtari, Maryam Mehri Dehnavi, Mert Gurbuzbalaban

Abstract: In this paper, we consider distributed algorithms for solving the empirical risk minimization problem under the master/worker communication model. We develop a distributed asynchronous quasi-Newton algorithm that can achieve superlinear convergence. To our knowledge, this is the first distributed asynchronous algorithm with superlinear convergence guarantees. Our algorithm is communication-efficie… ▽ More In this paper, we consider distributed algorithms for solving the empirical risk minimization problem under the master/worker communication model. We develop a distributed asynchronous quasi-Newton algorithm that can achieve superlinear convergence. To our knowledge, this is the first distributed asynchronous algorithm with superlinear convergence guarantees. Our algorithm is communication-efficient in the sense that at every iteration the master node and workers communicate vectors of size $O(p)$, where $p$ is the dimension of the decision variable. The proposed method is based on a distributed asynchronous averaging scheme of decision vectors and gradients in a way to effectively capture the local Hessian information of the objective function. Our convergence theory supports asynchronous computations subject to both bounded delays and unbounded delays with a bounded time-average. Unlike in the majority of asynchronous optimization literature, we do not require choosing smaller stepsize when delays are huge. We provide numerical experiments that match our theoretical results and showcase significant improvement comparing to state-of-the-art distributed algorithms. △ Less

Submitted 10 June, 2019; v1 submitted 2 June, 2019; originally announced June 2019.

arXiv:1902.07144 [pdf, ps, other]

doi 10.1016/j.jmps.2019.07.005

On the Properties of Phononic Eigenvalue Problems

Authors: Amir Ashkan Mokhtari, Yan Lu, Ankit Srivastava

Abstract: In this paper, we consider the operator properties of various phononic eigenvalue problems. We aim to answer some fundamental questions about the eigenvalues and eigenvectors of phononic operators. These include questions about the potential real and complex nature of the eigenvalues, whether the eigenvectors form a complete basis, what are the right orthogonality relationships, and how to create… ▽ More In this paper, we consider the operator properties of various phononic eigenvalue problems. We aim to answer some fundamental questions about the eigenvalues and eigenvectors of phononic operators. These include questions about the potential real and complex nature of the eigenvalues, whether the eigenvectors form a complete basis, what are the right orthogonality relationships, and how to create a complete basis when none may exist at the outset. In doing so we present a unified understanding of the properties of the phononic eigenvalues and eigenvectors which would emerge from any numerical method employed to compute such quantities. We show that the phononic problem can be cast into linear eigenvalue forms from which such quantities as frequencies, wavenumbers, and desired components of wavevectors can be directly ascertained without resorting to searches or quadratic eigenvalue problems and that the relevant properties of such quantities can be determined apriori through the analysis of the associated operators. We further show how the Plane Wave Expansion (PWE) method may be extended to solve each of these eigenvalue forms, thus extending the applicability of the PWE method to cases beyond those which have been considered till now. The theoretical discussions are supplemented with supporting numerical calculations. The techniques and results presented here directly apply to wave propagation in other periodic systems such as photonics. △ Less

Submitted 7 July, 2019; v1 submitted 16 February, 2019; originally announced February 2019.

Journal ref: Journal of the Mechanics and Physics of Solids, 2019

arXiv:1902.06992 [pdf, other]

Stochastic Conditional Gradient++

Authors: Hamed Hassani, Amin Karbasi, Aryan Mokhtari, Zebang Shen

Abstract: In this paper, we consider the general non-oblivious stochastic optimization where the underlying stochasticity may change during the optimization procedure and depends on the point at which the function is evaluated. We develop Stochastic Frank-Wolfe++ ($\text{SFW}{++} $), an efficient variant of the conditional gradient method for minimizing a smooth non-convex function subject to a convex body… ▽ More In this paper, we consider the general non-oblivious stochastic optimization where the underlying stochasticity may change during the optimization procedure and depends on the point at which the function is evaluated. We develop Stochastic Frank-Wolfe++ ($\text{SFW}{++} $), an efficient variant of the conditional gradient method for minimizing a smooth non-convex function subject to a convex body constraint. We show that $\text{SFW}{++} $ converges to an $ε$-first order stationary point by using $O(1/ε^3)$ stochastic gradients. Once further structures are present, $\text{SFW}{++}$'s theoretical guarantees, in terms of the convergence rate and quality of its solution, improve. In particular, for minimizing a convex function, $\text{SFW}{++} $ achieves an $ε$-approximate optimum while using $O(1/ε^2)$ stochastic gradients. It is known that this rate is optimal in terms of stochastic gradient evaluations. Similarly, for maximizing a monotone continuous DR-submodular function, a slightly different form of $\text{SFW}{++} $, called Stochastic Continuous Greedy++ ($\text{SCG}{++} $), achieves a tight $[(1-1/e)\text{OPT} -ε]$ solution while using $O(1/ε^2)$ stochastic gradients. Through an information theoretic argument, we also prove that $\text{SCG}{++} $'s convergence rate is optimal. Finally, for maximizing a non-monotone continuous DR-submodular function, we can achieve a $[(1/e)\text{OPT} -ε]$ solution by using $O(1/ε^2)$ stochastic gradients. We should highlight that our results and our novel variance reduction technique trivially extend to the standard and easier oblivious stochastic optimization settings for (non-)covex and continuous submodular settings. △ Less

Submitted 8 September, 2020; v1 submitted 19 February, 2019; originally announced February 2019.

arXiv:1902.06332 [pdf, other]

Quantized Frank-Wolfe: Faster Optimization, Lower Communication, and Projection Free

Authors: Mingrui Zhang, Lin Chen, Aryan Mokhtari, Hamed Hassani, Amin Karbasi

Abstract: How can we efficiently mitigate the overhead of gradient communications in distributed optimization? This problem is at the heart of training scalable machine learning models and has been mainly studied in the unconstrained setting. In this paper, we propose Quantized-Frank-Wolfe (QFW), the first projection-free and communication-efficient algorithm for solving constrained optimization problems at… ▽ More How can we efficiently mitigate the overhead of gradient communications in distributed optimization? This problem is at the heart of training scalable machine learning models and has been mainly studied in the unconstrained setting. In this paper, we propose Quantized-Frank-Wolfe (QFW), the first projection-free and communication-efficient algorithm for solving constrained optimization problems at scale. We consider both convex and non-convex objective functions, expressed as a finite-sum or more generally a stochastic optimization problem, and provide strong theoretical guarantees on the convergence rate of QFW. This is accomplished by proposing novel quantization schemes that efficiently compress gradients while controlling the noise variance introduced during this process. Finally, we empirically validate the efficiency of QFW in terms of communication and the quality of returned solution against natural baselines. △ Less

Submitted 30 May, 2019; v1 submitted 17 February, 2019; originally announced February 2019.

arXiv:1901.08511 [pdf, ps, other]

A Unified Analysis of Extra-gradient and Optimistic Gradient Methods for Saddle Point Problems: Proximal Point Approach

Authors: Aryan Mokhtari, Asuman Ozdaglar, Sarath Pattathil

Abstract: In this paper we consider solving saddle point problems using two variants of Gradient Descent-Ascent algorithms, Extra-gradient (EG) and Optimistic Gradient Descent Ascent (OGDA) methods. We show that both of these algorithms admit a unified analysis as approximations of the classical proximal point method for solving saddle point problems. This viewpoint enables us to develop a new framework for… ▽ More In this paper we consider solving saddle point problems using two variants of Gradient Descent-Ascent algorithms, Extra-gradient (EG) and Optimistic Gradient Descent Ascent (OGDA) methods. We show that both of these algorithms admit a unified analysis as approximations of the classical proximal point method for solving saddle point problems. This viewpoint enables us to develop a new framework for analyzing EG and OGDA for bilinear and strongly convex-strongly concave settings. Moreover, we use the proximal point approximation interpretation to generalize the results for OGDA for a wide range of parameters. △ Less

Submitted 5 September, 2019; v1 submitted 24 January, 2019; originally announced January 2019.

Comments: 25 pages, 3 figures

arXiv:1811.04365 [pdf, ps, other]

doi 10.1016/j.jmps.2019.02.016

On the Emergence of Negative Effective Density and Modulus in 2-phase Phononic Crystals

Authors: Amir Ashkan Mokhtari, Yan Lu, Ankit Srivastava

Abstract: In this paper we report metamaterial properties including negative and singular effective properties for what would traditionally be considered non locally resonant 2-phase phononic unit cells. The negative effective material properties reported here occur well below the homogenization limit and are, therefore, acceptable descriptions of overall behavior. The material property combinations which m… ▽ More In this paper we report metamaterial properties including negative and singular effective properties for what would traditionally be considered non locally resonant 2-phase phononic unit cells. The negative effective material properties reported here occur well below the homogenization limit and are, therefore, acceptable descriptions of overall behavior. The material property combinations which make this possible were first revealed by a novel level set based topology optimization process which we describe. The optimization process revealed that a 2-phase unit cell in which one of the phases is simultaneously lighter and stiffer than the other results in dynamic behavior which has all the attendant characteristics of a locally resonant composite including negative effective properties far below the homogenization limit. We investigate this further using the Craig-Bampton decomposition and clarify that these properties emerge through an interplay between the fundamental internal modeshape of the unit cell and a rigid body mode. Through explicit numerical calculations on 1-D, 2-phase unit cells, we show that negative effective properties only appear for the specific material property combination mentioned above. Furthermore, we provide a proof which supports this conclusion. The concept is also shown to hold for 2-D unit cells where we show that an appropriately designed hexagonal unit cell made of 2 material phases exhibits negative effective shear modulus and density in an appropriate frequency regime in which it also exhibits negative refraction. An important conclusion of this paper is that the class of unit cells expected to result in negative properties can be expanded beyond the classic unit cell (three-phase unit cells with an explicit locally resonant phase) to include topologically simpler 2-phase unit cells as well. △ Less

Submitted 7 January, 2019; v1 submitted 11 November, 2018; originally announced November 2018.

arXiv:1811.02521 [pdf, ps, other]

Achieving Acceleration in Distributed Optimization via Direct Discretization of the Heavy-Ball ODE

Authors: **gzhao Zhang, César A. Uribe, Aryan Mokhtari, Ali Jadbabaie

Abstract: We develop a distributed algorithm for convex Empirical Risk Minimization, the problem of minimizing large but finite sum of convex functions over networks. The proposed algorithm is derived from directly discretizing the second-order heavy-ball differential equation and results in an accelerated convergence rate, i.e, faster than distributed gradient descent-based methods for strongly convex obje… ▽ More We develop a distributed algorithm for convex Empirical Risk Minimization, the problem of minimizing large but finite sum of convex functions over networks. The proposed algorithm is derived from directly discretizing the second-order heavy-ball differential equation and results in an accelerated convergence rate, i.e, faster than distributed gradient descent-based methods for strongly convex objectives that may not be smooth. Notably, we achieve acceleration without resorting to the well-known Nesterov's momentum approach. We provide numerical experiments and contrast the proposed method with recently proposed optimal distributed optimization algorithms. △ Less

Submitted 6 November, 2018; originally announced November 2018.

arXiv:1810.11507 [pdf, other]

Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy

Authors: Majid Jahani, Xi He, Chenxin Ma, Aryan Mokhtari, Dheevatsa Mudigere, Alejandro Ribeiro, Martin Takáč

Abstract: In this paper, we propose a Distributed Accumulated Newton Conjugate gradiEnt (DANCE) method in which sample size is gradually increasing to quickly obtain a solution whose empirical loss is under satisfactory statistical accuracy. Our proposed method is multistage in which the solution of a stage serves as a warm start for the next stage which contains more samples (including the samples in the p… ▽ More In this paper, we propose a Distributed Accumulated Newton Conjugate gradiEnt (DANCE) method in which sample size is gradually increasing to quickly obtain a solution whose empirical loss is under satisfactory statistical accuracy. Our proposed method is multistage in which the solution of a stage serves as a warm start for the next stage which contains more samples (including the samples in the previous stage). The proposed multistage algorithm reduces the number of passes over data to achieve the statistical accuracy of the full training set. Moreover, our algorithm in nature is easy to be distributed and shares the strong scaling property indicating that acceleration is always expected by using more computing nodes. Various iteration complexity results regarding descent direction computation, communication efficiency and stop** criteria are analyzed under convex setting. Our numerical results illustrate that the proposed method outperforms other comparable methods for solving learning problems including neural networks. △ Less

Submitted 9 March, 2020; v1 submitted 26 October, 2018; originally announced October 2018.

Comments: Updated numerical results

arXiv:1809.02162 [pdf, ps, other]

Esca** Saddle Points in Constrained Optimization

Authors: Aryan Mokhtari, Asuman Ozdaglar, Ali Jadbabaie

Abstract: In this paper, we study the problem of esca** from saddle points in smooth nonconvex optimization problems subject to a convex set $\mathcal{C}$. We propose a generic framework that yields convergence to a second-order stationary point of the problem, if the convex set $\mathcal{C}$ is simple for a quadratic objective function. Specifically, our results hold if one can find a $ρ$-approximate sol… ▽ More In this paper, we study the problem of esca** from saddle points in smooth nonconvex optimization problems subject to a convex set $\mathcal{C}$. We propose a generic framework that yields convergence to a second-order stationary point of the problem, if the convex set $\mathcal{C}$ is simple for a quadratic objective function. Specifically, our results hold if one can find a $ρ$-approximate solution of a quadratic program subject to $\mathcal{C}$ in polynomial time, where $ρ<1$ is a positive constant that depends on the structure of the set $\mathcal{C}$. Under this condition, we show that the sequence of iterates generated by the proposed framework reaches an $(ε,γ)$-second order stationary point (SOSP) in at most $\mathcal{O}(\max\{ε^{-2},ρ^{-3}γ^{-3}\})$ iterations. We further characterize the overall complexity of reaching an SOSP when the convex set $\mathcal{C}$ can be written as a set of quadratic constraints and the objective function Hessian has a specific structure over the convex set $\mathcal{C}$. Finally, we extend our results to the stochastic setting and characterize the number of stochastic gradient and Hessian evaluations to reach an $(ε,γ)$-SOSP. △ Less

Submitted 9 October, 2018; v1 submitted 6 September, 2018; originally announced September 2018.

arXiv:1809.01212 [pdf, other]

doi 10.1109/TSP.2019.2951216

A Primal-Dual Quasi-Newton Method for Exact Consensus Optimization

Authors: Mark Eisen, Aryan Mokhtari, Alejandro Ribeiro

Abstract: We introduce the primal-dual quasi-Newton (PD-QN) method as an approximated second order method for solving decentralized optimization problems. The PD-QN method performs quasi-Newton updates on both the primal and dual variables of the consensus optimization problem to find the optimal point of the augmented Lagrangian. By optimizing the augmented Lagrangian, the PD-QN method is able to find the… ▽ More We introduce the primal-dual quasi-Newton (PD-QN) method as an approximated second order method for solving decentralized optimization problems. The PD-QN method performs quasi-Newton updates on both the primal and dual variables of the consensus optimization problem to find the optimal point of the augmented Lagrangian. By optimizing the augmented Lagrangian, the PD-QN method is able to find the exact solution to the consensus problem with a linear rate of convergence. We derive fully decentralized quasi-Newton updates that approximate second order information to reduce the computational burden relative to dual methods and to make the method more robust in ill-conditioned problems relative to first order methods. The linear convergence rate of PD-QN is established formally and strong performance advantages relative to existing dual and primal-dual methods are shown numerically. △ Less

Submitted 10 July, 2019; v1 submitted 4 September, 2018; originally announced September 2018.

arXiv:1808.02269 [pdf]

doi 10.1088/1367-2630/18/11/113040

Enhanced magnetic properties in ZnCoAlO caused by exchangecoupling to Co nanoparticles

Authors: Qi Feng, Wala Dizayee, Xiaoli Li, David S Score, James R Neal, Anthony J Behan, Abbas Mokhtari, Marzook S Alshammari, Mohammed S Al-Qahtani, Harry J Blythe, Roy W Chantrell, Steve M Heald, Xiao-Hong Xu, A Mark Fox, Gillian A Gehring

Abstract: Wereport the results of a sequence of magnetisation and magneto-optical studies on laser ablated thin films of ZnCoAlO and ZnCoO that contain a small amount of metallic cobalt. The results are compared to those expected when all the magnetization is due to isolated metallic clusters of cobalt and with an oxide sample that is almost free from metallic inclusions. Using a variety of direct magnetic… ▽ More Wereport the results of a sequence of magnetisation and magneto-optical studies on laser ablated thin films of ZnCoAlO and ZnCoO that contain a small amount of metallic cobalt. The results are compared to those expected when all the magnetization is due to isolated metallic clusters of cobalt and with an oxide sample that is almost free from metallic inclusions. Using a variety of direct magnetic measurements and also magnetic circular dichroism we find that there is ferromagnetism within both the oxide and the metallic inclusions, and furthermore that these magnetic components are exchange-coupled when aluminium is included. This enhances both the coercive field and the remanence. Hence the presence of a controlled quantity of metallic nanoparticles in ZnAlO can improve the magnetic response of the oxide, thus giving great advantages for applications in spintronics. △ Less

Submitted 7 August, 2018; originally announced August 2018.

Comments: 13 pages, 6 figures

Journal ref: New J. Phys. 18 (2016) 113040

arXiv:1806.11536 [pdf, other]

doi 10.1109/TSP.2019.2932876

An Exact Quantized Decentralized Gradient Descent Algorithm

Authors: Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ramtin Pedarsani

Abstract: We consider the problem of decentralized consensus optimization, where the sum of $n$ smooth and strongly convex functions are minimized over $n$ distributed agents that form a connected network. In particular, we consider the case that the communicated local decision variables among nodes are quantized in order to alleviate the communication bottleneck in distributed optimization. We propose the… ▽ More We consider the problem of decentralized consensus optimization, where the sum of $n$ smooth and strongly convex functions are minimized over $n$ distributed agents that form a connected network. In particular, we consider the case that the communicated local decision variables among nodes are quantized in order to alleviate the communication bottleneck in distributed optimization. We propose the Quantized Decentralized Gradient Descent (QDGD) algorithm, in which nodes update their local decision variables by combining the quantized information received from their neighbors with their local information. We prove that under standard strong convexity and smoothness assumptions for the objective function, QDGD achieves a vanishing mean solution error under customary conditions for quantizers. To the best of our knowledge, this is the first algorithm that achieves vanishing consensus error in the presence of quantization noise. Moreover, we provide simulation results that show tight agreement between our derived theoretical convergence rate and the numerical results. △ Less

Submitted 1 August, 2019; v1 submitted 29 June, 2018; originally announced June 2018.

arXiv:1805.09969 [pdf, ps, other]

Towards More Efficient Stochastic Decentralized Learning: Faster Convergence and Sparse Communication

Authors: Zebang Shen, Aryan Mokhtari, Tengfei Zhou, Peilin Zhao, Hui Qian

Abstract: Recently, the decentralized optimization problem is attracting growing attention. Most existing methods are deterministic with high per-iteration cost and have a convergence rate quadratically depending on the problem condition number. Besides, the dense communication is necessary to ensure the convergence even if the dataset is sparse. In this paper, we generalize the decentralized optimization p… ▽ More Recently, the decentralized optimization problem is attracting growing attention. Most existing methods are deterministic with high per-iteration cost and have a convergence rate quadratically depending on the problem condition number. Besides, the dense communication is necessary to ensure the convergence even if the dataset is sparse. In this paper, we generalize the decentralized optimization problem to a monotone operator root finding problem, and propose a stochastic algorithm named DSBA that (i) converges geometrically with a rate linearly depending on the problem condition number, and (ii) can be implemented using sparse communication only. Additionally, DSBA handles learning problems like AUC-maximization which cannot be tackled efficiently in the decentralized setting. Experiments on convex minimization and AUC-maximization validate the efficiency of our method. △ Less

Submitted 24 May, 2018; originally announced May 2018.

Comments: Accepted to ICML 2018

arXiv:1805.00521 [pdf, other]

Direct Runge-Kutta Discretization Achieves Acceleration

Authors: **gzhao Zhang, Aryan Mokhtari, Suvrit Sra, Ali Jadbabaie

Abstract: We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method. When the function is smooth enough, we show that acceleration can be achieved by a stable discretization of this ODE using standard Runge-Kutta integrators. Specifically, we prove that under Lip… ▽ More We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method. When the function is smooth enough, we show that acceleration can be achieved by a stable discretization of this ODE using standard Runge-Kutta integrators. Specifically, we prove that under Lipschitz-gradient, convexity and order-$(s+2)$ differentiability assumptions, the sequence of iterates generated by discretizing the proposed second-order ODE converges to the optimal solution at a rate of $\mathcal{O}({N^{-2\frac{s}{s+1}}})$, where $s$ is the order of the Runge-Kutta numerical integrator. Furthermore, we introduce a new local flatness condition on the objective, under which rates even faster than $\mathcal{O}(N^{-2})$ can be achieved with low-order integrators and only gradient information. Notably, this flatness condition is satisfied by several standard loss functions used in machine learning. We provide numerical experiments that verify the theoretical rates predicted by our results. △ Less

Submitted 27 November, 2018; v1 submitted 1 May, 2018; originally announced May 2018.

Comments: 24 pages. 4 figures

arXiv:1804.09554 [pdf, other]

Stochastic Conditional Gradient Methods: From Convex Minimization to Submodular Maximization

Authors: Aryan Mokhtari, Hamed Hassani, Amin Karbasi

Abstract: This paper considers stochastic optimization problems for a large class of objective functions, including convex and continuous submodular. Stochastic proximal gradient methods have been widely used to solve such problems; however, their applicability remains limited when the problem dimension is large and the projection onto a convex set is costly. Instead, stochastic conditional gradient methods… ▽ More This paper considers stochastic optimization problems for a large class of objective functions, including convex and continuous submodular. Stochastic proximal gradient methods have been widely used to solve such problems; however, their applicability remains limited when the problem dimension is large and the projection onto a convex set is costly. Instead, stochastic conditional gradient methods are proposed as an alternative solution relying on (i) Approximating gradients via a simple averaging technique requiring a single stochastic gradient evaluation per iteration; (ii) Solving a linear program to compute the descent/ascent direction. The averaging technique reduces the noise of gradient approximations as time progresses, and replacing projection step in proximal methods by a linear program lowers the computational complexity of each iteration. We show that under convexity and smoothness assumptions, our proposed method converges to the optimal objective function value at a sublinear rate of $O(1/t^{1/3})$. Further, for a monotone and continuous DR-submodular function and subject to a general convex body constraint, we prove that our proposed method achieves a $((1-1/e)OPT-\eps)$ guarantee with $O(1/\eps^3)$ stochastic gradient computations. This guarantee matches the known hardness results and closes the gap between deterministic and stochastic continuous submodular maximization. Additionally, we obtain $((1/e)OPT -\eps)$ guarantee after using $O(1/\eps^3)$ stochastic gradients for the case that the objective function is continuous DR-submodular but non-monotone and the constraint set is down-closed. By using stochastic continuous optimization as an interface, we provide the first $(1-1/e)$ tight approximation guarantee for maximizing a monotone but stochastic submodular set function subject to a matroid constraint and $(1/e)$ approximation guarantee for the non-monotone case. △ Less

Submitted 12 November, 2018; v1 submitted 24 April, 2018; originally announced April 2018.

Comments: arXiv admin note: text overlap with arXiv:1711.01660

arXiv:1802.03825 [pdf, other]

Decentralized Submodular Maximization: Bridging Discrete and Continuous Settings

Authors: Aryan Mokhtari, Hamed Hassani, Amin Karbasi

Abstract: In this paper, we showcase the interplay between discrete and continuous optimization in network-structured settings. We propose the first fully decentralized optimization method for a wide class of non-convex objective functions that possess a diminishing returns property. More specifically, given an arbitrary connected network and a global continuous submodular function, formed by a sum of local… ▽ More In this paper, we showcase the interplay between discrete and continuous optimization in network-structured settings. We propose the first fully decentralized optimization method for a wide class of non-convex objective functions that possess a diminishing returns property. More specifically, given an arbitrary connected network and a global continuous submodular function, formed by a sum of local functions, we develop Decentralized Continuous Greedy (DCG), a message passing algorithm that converges to the tight (1-1/e) approximation factor of the optimum global solution using only local computation and communication. We also provide strong convergence bounds as a function of network size and spectral characteristics of the underlying topology. Interestingly, DCG readily provides a simple recipe for decentralized discrete submodular maximization through the means of continuous relaxations. Formally, we demonstrate that by lifting the local discrete functions to continuous domains and using DCG as an interface we can develop a consensus algorithm that also achieves the tight (1-1/e) approximation guarantee of the global discrete solution once a proper rounding scheme is applied. △ Less

Submitted 11 February, 2018; originally announced February 2018.

arXiv:1711.01660 [pdf, other]

Conditional Gradient Method for Stochastic Submodular Maximization: Closing the Gap

Authors: Aryan Mokhtari, Hamed Hassani, Amin Karbasi

Abstract: In this paper, we study the problem of \textit{constrained} and \textit{stochastic} continuous submodular maximization. Even though the objective function is not concave (nor convex) and is defined in terms of an expectation, we develop a variant of the conditional gradient method, called \alg, which achieves a \textit{tight} approximation guarantee. More precisely, for a monotone and continuous D… ▽ More In this paper, we study the problem of \textit{constrained} and \textit{stochastic} continuous submodular maximization. Even though the objective function is not concave (nor convex) and is defined in terms of an expectation, we develop a variant of the conditional gradient method, called \alg, which achieves a \textit{tight} approximation guarantee. More precisely, for a monotone and continuous DR-submodular function and subject to a \textit{general} convex body constraint, we prove that \alg achieves a $[(1-1/e)\text{OPT} -\eps]$ guarantee (in expectation) with $\mathcal{O}{(1/\eps^3)}$ stochastic gradient computations. This guarantee matches the known hardness results and closes the gap between deterministic and stochastic continuous submodular maximization. By using stochastic continuous optimization as an interface, we also provide the first $(1-1/e)$ tight approximation guarantee for maximizing a \textit{monotone but stochastic} submodular \textit{set} function subject to a general matroid constraint. △ Less

Submitted 5 November, 2017; originally announced November 2017.

arXiv:1710.03738 [pdf, ps, other]

doi 10.1016/j.physletb.2018.09.020

Diffusivities bounds in the presence of Weyl corrections

Authors: Ali Mokhtari, Seyed Ali Hosseini Mansoori, Kazem Bitaghsir Fadafan

Abstract: In this paper, we investigate the behavior of the thermoelectric DC conductivities in the presence of Weyl corrections with momentum dissipation in the incoherent limit. Moreover, we compute the butterfly velocity and study the charge and energy diffusion with broken translational symmetry. Our results show that the Weyl coupling $γ$, violates the bounds on the charge and energy diffusivity. It is… ▽ More In this paper, we investigate the behavior of the thermoelectric DC conductivities in the presence of Weyl corrections with momentum dissipation in the incoherent limit. Moreover, we compute the butterfly velocity and study the charge and energy diffusion with broken translational symmetry. Our results show that the Weyl coupling $γ$, violates the bounds on the charge and energy diffusivity. It is also shown that the Weyl corrections violate the bound on the DC electrical conductivity in the incoherent limit. △ Less

Submitted 24 September, 2018; v1 submitted 10 October, 2017; originally announced October 2017.

Comments: v4: The appendix D and E were added

arXiv:1709.00599 [pdf, other]

First-Order Adaptive Sample Size Methods to Reduce Complexity of Empirical Risk Minimization

Authors: Aryan Mokhtari, Alejandro Ribeiro

Abstract: This paper studies empirical risk minimization (ERM) problems for large-scale datasets and incorporates the idea of adaptive sample size methods to improve the guaranteed convergence bounds for first-order stochastic and deterministic methods. In contrast to traditional methods that attempt to solve the ERM problem corresponding to the full dataset directly, adaptive sample size schemes start with… ▽ More This paper studies empirical risk minimization (ERM) problems for large-scale datasets and incorporates the idea of adaptive sample size methods to improve the guaranteed convergence bounds for first-order stochastic and deterministic methods. In contrast to traditional methods that attempt to solve the ERM problem corresponding to the full dataset directly, adaptive sample size schemes start with a small number of samples and solve the corresponding ERM problem to its statistical accuracy. The sample size is then grown geometrically -- e.g., scaling by a factor of two -- and use the solution of the previous ERM as a warm start for the new ERM. Theoretical analyses show that the use of adaptive sample size methods reduces the overall computational cost of achieving the statistical accuracy of the whole dataset for a broad range of deterministic and stochastic first-order methods. The gains are specific to the choice of method. When particularized to, e.g., accelerated gradient descent and stochastic variance reduce gradient, the computational cost advantage is a logarithm of the number of training samples. Numerical experiments on various datasets confirm theoretical claims and showcase the gains of using the proposed adaptive sample size scheme. △ Less

Submitted 2 September, 2017; originally announced September 2017.

arXiv:1707.08028 [pdf, ps, other]

A Newton-Based Method for Nonconvex Optimization with Fast Evasion of Saddle Points

Authors: Santiago Paternain, Aryan Mokhtari, Alejandro Ribeiro

Abstract: Machine learning problems such as neural network training, tensor decomposition, and matrix factorization, require local minimization of a nonconvex function. This local minimization is challenged by the presence of saddle points, of which there can be many and from which descent methods may take inordinately large number of iterations to escape. This paper presents a second-order method that modi… ▽ More Machine learning problems such as neural network training, tensor decomposition, and matrix factorization, require local minimization of a nonconvex function. This local minimization is challenged by the presence of saddle points, of which there can be many and from which descent methods may take inordinately large number of iterations to escape. This paper presents a second-order method that modifies the update of Newton's method by replacing the negative eigenvalues of the Hessian by their absolute values and uses a truncated version of the resulting matrix to account for the objective's curvature. The method is shown to escape saddles in at most $1 + \log_{3/2} (δ/2\varepsilon)$ iterations where $\varepsilon$ is the target optimality and $δ$ characterizes a point sufficiently far away from the saddle. This base of this exponential escape is $3/2$ independently of problem constants. Adding classical properties of Newton's method, the paper proves convergence to a local minimum with probability $1-p$ in $O\left(\log(1/p)) + O(\log(1/\varepsilon)\right)$ iterations. △ Less

Submitted 20 July, 2018; v1 submitted 25 July, 2017; originally announced July 2017.

arXiv:1705.07957 [pdf, ps, other]

Large Scale Empirical Risk Minimization via Truncated Adaptive Newton Method

Authors: Mark Eisen, Aryan Mokhtari, Alejandro Ribeiro

Abstract: We consider large scale empirical risk minimization (ERM) problems, where both the problem dimension and variable size is large. In these cases, most second order methods are infeasible due to the high cost in both computing the Hessian over all samples and computing its inverse in high dimensions. In this paper, we propose a novel adaptive sample size second-order method, which reduces the cost o… ▽ More We consider large scale empirical risk minimization (ERM) problems, where both the problem dimension and variable size is large. In these cases, most second order methods are infeasible due to the high cost in both computing the Hessian over all samples and computing its inverse in high dimensions. In this paper, we propose a novel adaptive sample size second-order method, which reduces the cost of computing the Hessian by solving a sequence of ERM problems corresponding to a subset of samples and lowers the cost of computing the Hessian inverse using a truncated eigenvalue decomposition. We show that while we geometrically increase the size of the training set at each stage, a single iteration of the truncated Newton method is sufficient to solve the new ERM within its statistical accuracy. Moreover, for a large number of samples we are allowed to double the size of the training set at each stage, and the proposed method subsequently reaches the statistical accuracy of the full training set approximately after two effective passes. In addition to this theoretical result, we show empirically on a number of well known data sets that the proposed truncated adaptive sample size algorithm outperforms stochastic alternatives for solving ERM problems. △ Less

Submitted 22 May, 2017; originally announced May 2017.

arXiv:1702.00709 [pdf, other]

IQN: An Incremental Quasi-Newton Method with Local Superlinear Convergence Rate

Authors: Aryan Mokhtari, Mark Eisen, Alejandro Ribeiro

Abstract: The problem of minimizing an objective that can be written as the sum of a set of $n$ smooth and strongly convex functions is considered. The Incremental Quasi-Newton (IQN) method proposed here belongs to the family of stochastic and incremental methods that have a cost per iteration independent of $n$. IQN iterations are a stochastic version of BFGS iterations that use memory to reduce the varian… ▽ More The problem of minimizing an objective that can be written as the sum of a set of $n$ smooth and strongly convex functions is considered. The Incremental Quasi-Newton (IQN) method proposed here belongs to the family of stochastic and incremental methods that have a cost per iteration independent of $n$. IQN iterations are a stochastic version of BFGS iterations that use memory to reduce the variance of stochastic approximations. The convergence properties of IQN bridge a gap between deterministic and stochastic quasi-Newton methods. Deterministic quasi-Newton methods exploit the possibility of approximating the Newton step using objective gradient differences. They are appealing because they have a smaller computational cost per iteration relative to Newton's method and achieve a superlinear convergence rate under customary regularity assumptions. Stochastic quasi-Newton methods utilize stochastic gradient differences in lieu of actual gradient differences. This makes their computational cost per iteration independent of the number of objective functions $n$. However, existing stochastic quasi-Newton methods have sublinear or linear convergence at best. IQN is the first stochastic quasi-Newton method proven to converge superlinearly in a local neighborhood of the optimal solution. IQN differs from state-of-the-art incremental quasi-Newton methods in three aspects: (i) The use of aggregated information of variables, gradients, and quasi-Newton Hessian approximation matrices to reduce the noise of gradient and Hessian approximations. (ii) The approximation of each individual function by its Taylor's expansion in which the linear and quadratic terms are evaluated with respect to the same iterate. (iii) The use of a cyclic scheme to update the functions in lieu of a random selection routine. We use these fundamental properties of IQN to establish its local superlinear convergence rate. △ Less

Submitted 27 March, 2017; v1 submitted 2 February, 2017; originally announced February 2017.

arXiv:1611.00347 [pdf, other]

Surpassing Gradient Descent Provably: A Cyclic Incremental Method with Linear Convergence Rate

Authors: Aryan Mokhtari, Mert Gürbüzbalaban, Alejandro Ribeiro

Abstract: Recently, there has been growing interest in develo** optimization methods for solving large-scale machine learning problems. Most of these problems boil down to the problem of minimizing an average of a finite set of smooth and strongly convex functions where the number of functions $n$ is large. Gradient descent method (GD) is successful in minimizing convex problems at a fast linear rate; how… ▽ More Recently, there has been growing interest in develo** optimization methods for solving large-scale machine learning problems. Most of these problems boil down to the problem of minimizing an average of a finite set of smooth and strongly convex functions where the number of functions $n$ is large. Gradient descent method (GD) is successful in minimizing convex problems at a fast linear rate; however, it is not applicable to the considered large-scale optimization setting because of the high computational complexity. Incremental methods resolve this drawback of gradient methods by replacing the required gradient for the descent direction with an incremental gradient approximation. They operate by evaluating one gradient per iteration and executing the average of the $n$ available gradients as a gradient approximate. Although, incremental methods reduce the computational cost of GD, their convergence rates do not justify their advantage relative to GD in terms of the total number of gradient evaluations until convergence. In this paper, we introduce a Double Incremental Aggregated Gradient method (DIAG) that computes the gradient of only one function at each iteration, which is chosen based on a cyclic scheme, and uses the aggregated average gradient of all the functions to approximate the full gradient. The iterates of the proposed DIAG method uses averages of both iterates and gradients in oppose to classic incremental methods that utilize gradient averages but do not utilize iterate averages. We prove that not only the proposed DIAG method converges linearly to the optimal solution, but also its linear convergence factor justifies the advantage of incremental methods on GD. In particular, we prove that the worst case performance of DIAG is better than the worst case performance of GD. △ Less

Submitted 7 February, 2018; v1 submitted 1 November, 2016; originally announced November 2016.

arXiv:1610.02143 [pdf, other]

doi 10.1109/TSP.2017.2679690

Stochastic Averaging for Constrained Optimization with Application to Online Resource Allocation

Authors: Tianyi Chen, Aryan Mokhtari, Xin Wang, Alejandro Ribeiro, Georgios B. Giannakis

Abstract: Existing approaches to resource allocation for nowadays stochastic networks are challenged to meet fast convergence and tolerable delay requirements. The present paper leverages online learning advances to facilitate stochastic resource allocation tasks. By recognizing the central role of Lagrange multipliers, the underlying constrained optimization problem is formulated as a machine learning task… ▽ More Existing approaches to resource allocation for nowadays stochastic networks are challenged to meet fast convergence and tolerable delay requirements. The present paper leverages online learning advances to facilitate stochastic resource allocation tasks. By recognizing the central role of Lagrange multipliers, the underlying constrained optimization problem is formulated as a machine learning task involving both training and operational modes, with the goal of learning the sought multipliers in a fast and efficient manner. To this end, an order-optimal offline learning approach is developed first for batch training, and it is then generalized to the online setting with a procedure termed learn-and-adapt. The novel resource allocation protocol permeates benefits of stochastic approximation and statistical learning to obtain low-complexity online updates with learning errors close to the statistical accuracy limits, while still preserving adaptation performance, which in the stochastic network optimization context guarantees queue stability. Analysis and simulated tests demonstrate that the proposed data-driven approach improves the delay and convergence performance of existing resource allocation schemes. △ Less

Submitted 26 February, 2017; v1 submitted 7 October, 2016; originally announced October 2016.

arXiv:1606.04991 [pdf, other]

A Class of Parallel Doubly Stochastic Algorithms for Large-Scale Learning

Authors: Aryan Mokhtari, Alec Koppel, Alejandro Ribeiro

Abstract: We consider learning problems over training sets in which both, the number of training examples and the dimension of the feature vectors, are large. To solve these problems we propose the random parallel stochastic algorithm (RAPSA). We call the algorithm random parallel because it utilizes multiple parallel processors to operate on a randomly chosen subset of blocks of the feature vector. We call… ▽ More We consider learning problems over training sets in which both, the number of training examples and the dimension of the feature vectors, are large. To solve these problems we propose the random parallel stochastic algorithm (RAPSA). We call the algorithm random parallel because it utilizes multiple parallel processors to operate on a randomly chosen subset of blocks of the feature vector. We call the algorithm stochastic because processors choose training subsets uniformly at random. Algorithms that are parallel in either of these dimensions exist, but RAPSA is the first attempt at a methodology that is parallel in both the selection of blocks and the selection of elements of the training set. In RAPSA, processors utilize the randomly chosen functions to compute the stochastic gradient component associated with a randomly chosen block. The technical contribution of this paper is to show that this minimally coordinated algorithm converges to the optimal classifier when the training objective is convex. Moreover, we present an accelerated version of RAPSA (ARAPSA) that incorporates the objective function curvature information by premultiplying the descent direction by a Hessian approximation matrix. We further extend the results for asynchronous settings and show that if the processors perform their updates without any coordination the algorithms are still convergent to the optimal argument. RAPSA and its extensions are then numerically evaluated on a linear estimation problem and a binary image classification task using the MNIST handwritten digit dataset. △ Less

Submitted 15 June, 2016; originally announced June 2016.

Comments: arXiv admin note: substantial text overlap with arXiv:1603.06782

arXiv:1605.07659 [pdf, other]

Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy

Authors: Aryan Mokhtari, Alejandro Ribeiro

Abstract: We consider empirical risk minimization for large-scale datasets. We introduce Ada Newton as an adaptive algorithm that uses Newton's method with adaptive sample sizes. The main idea of Ada Newton is to increase the size of the training set by a factor larger than one in a way that the minimization variable for the current training set is in the local neighborhood of the optimal argument of the ne… ▽ More We consider empirical risk minimization for large-scale datasets. We introduce Ada Newton as an adaptive algorithm that uses Newton's method with adaptive sample sizes. The main idea of Ada Newton is to increase the size of the training set by a factor larger than one in a way that the minimization variable for the current training set is in the local neighborhood of the optimal argument of the next training set. This allows to exploit the quadratic convergence property of Newton's method and reach the statistical accuracy of each training set with only one iteration of Newton's method. We show theoretically and empirically that Ada Newton can double the size of the training set in each iteration to achieve the statistical accuracy of the full training set with about two passes over the dataset. △ Less

Submitted 24 May, 2016; originally announced May 2016.

arXiv:1605.00933 [pdf, ps, other]

doi 10.1109/TSP.2017.2666776

Decentralized Quasi-Newton Methods

Authors: Mark Eisen, Aryan Mokhtari, Alejandro Ribeiro

Abstract: We introduce the decentralized Broyden-Fletcher-Goldfarb-Shanno (D-BFGS) method as a variation of the BFGS quasi-Newton method for solving decentralized optimization problems. The D-BFGS method is of interest in problems that are not well conditioned, making first order decentralized methods ineffective, and in which second order information is not readily available, making second order decentrali… ▽ More We introduce the decentralized Broyden-Fletcher-Goldfarb-Shanno (D-BFGS) method as a variation of the BFGS quasi-Newton method for solving decentralized optimization problems. The D-BFGS method is of interest in problems that are not well conditioned, making first order decentralized methods ineffective, and in which second order information is not readily available, making second order decentralized methods impossible. D-BFGS is a fully distributed algorithm in which nodes approximate curvature information of themselves and their neighbors through the satisfaction of a secant condition. We additionally provide a formulation of the algorithm in asynchronous settings. Convergence of D-BFGS is established formally in both the synchronous and asynchronous settings and strong performance advantages relative to first order methods are shown numerically. △ Less

Submitted 3 May, 2016; originally announced May 2016.

arXiv:1603.08094 [pdf, other]

A Decentralized Second-Order Method for Dynamic Optimization

Authors: Aryan Mokhtari, Wei Shi, Qing Ling, Alejandro Ribeiro

Abstract: This paper considers decentralized dynamic optimization problems where nodes of a network try to minimize a sequence of time-varying objective functions in a real-time scheme. At each time slot, nodes have access to different summands of an instantaneous global objective function and they are allowed to exchange information only with their neighbors. This paper develops the application of the Exac… ▽ More This paper considers decentralized dynamic optimization problems where nodes of a network try to minimize a sequence of time-varying objective functions in a real-time scheme. At each time slot, nodes have access to different summands of an instantaneous global objective function and they are allowed to exchange information only with their neighbors. This paper develops the application of the Exact Second-Order Method (ESOM) to solve the dynamic optimization problem in a decentralized manner. The proposed dynamic ESOM algorithm operates by primal descending and dual ascending on a quadratic approximation of an augmented Lagrangian of the instantaneous consensus optimization problem. The convergence analysis of dynamic ESOM indicates that a Lyapunov function of the sequence of primal and dual errors converges linearly to an error bound when the local functions are strongly convex and have Lipschitz continuous gradients. Numerical results demonstrate the claim that the sequence of iterates generated by the proposed method is able to track the sequence of optimal arguments. △ Less

Submitted 26 March, 2016; originally announced March 2016.

arXiv:1603.07195 [pdf, other]

A Decentralized Quasi-Newton Method for Dual Formulations of Consensus Optimization

Authors: Mark Eisen, Aryan Mokhtari, Alejandro Ribeiro

Abstract: This paper considers consensus optimization problems where each node of a network has access to a different summand of an aggregate cost function. Nodes try to minimize the aggregate cost function, while they exchange information only with their neighbors. We modify the dual decomposition method to incorporate a curvature correction inspired by the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-New… ▽ More This paper considers consensus optimization problems where each node of a network has access to a different summand of an aggregate cost function. Nodes try to minimize the aggregate cost function, while they exchange information only with their neighbors. We modify the dual decomposition method to incorporate a curvature correction inspired by the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton method. The resulting dual D-BFGS method is a fully decentralized algorithm in which nodes approximate curvature information of themselves and their neighbors through the satisfaction of a secant condition. Dual D-BFGS is of interest in consensus optimization problems that are not well conditioned, making first order decentralized methods ineffective, and in which second order information is not readily available, making decentralized second order methods infeasible. Asynchronous implementation is discussed and convergence of D-BFGS is established formally for both synchronous and asynchronous implementations. Performance advantages relative to alternative decentralized algorithms are shown numerically. △ Less

Submitted 23 March, 2016; originally announced March 2016.

Comments: 8 pages

arXiv:1603.06782 [pdf, other]

Doubly Random Parallel Stochastic Methods for Large Scale Learning

Authors: Aryan Mokhtari, Alec Koppel, Alejandro Ribeiro

Abstract: We consider learning problems over training sets in which both, the number of training examples and the dimension of the feature vectors, are large. To solve these problems we propose the random parallel stochastic algorithm (RAPSA). We call the algorithm random parallel because it utilizes multiple processors to operate in a randomly chosen subset of blocks of the feature vector. We call the algo… ▽ More We consider learning problems over training sets in which both, the number of training examples and the dimension of the feature vectors, are large. To solve these problems we propose the random parallel stochastic algorithm (RAPSA). We call the algorithm random parallel because it utilizes multiple processors to operate in a randomly chosen subset of blocks of the feature vector. We call the algorithm parallel stochastic because processors choose elements of the training set randomly and independently. Algorithms that are parallel in either of these dimensions exist, but RAPSA is the first attempt at a methodology that is parallel in both, the selection of blocks and the selection of elements of the training set. In RAPSA, processors utilize the randomly chosen functions to compute the stochastic gradient component associated with a randomly chosen block. The technical contribution of this paper is to show that this minimally coordinated algorithm converges to the optimal classifier when the training objective is convex. In particular, we show that: (i) When using decreasing stepsizes, RAPSA converges almost surely over the random choice of blocks and functions. (ii) When using constant stepsizes, convergence is to a neighborhood of optimality with a rate that is linear in expectation. RAPSA is numerically evaluated on the MNIST digit recognition problem. △ Less

Submitted 22 March, 2016; originally announced March 2016.

arXiv:1603.04954 [pdf, other]

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

Authors: Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, Alejandro Ribeiro

Abstract: In this paper, we address tracking of a time-varying parameter with unknown dynamics. We formalize the problem as an instance of online optimization in a dynamic setting. Using online gradient descent, we propose a method that sequentially predicts the value of the parameter and in turn suffers a loss. The objective is to minimize the accumulation of losses over the time horizon, a notion that is… ▽ More In this paper, we address tracking of a time-varying parameter with unknown dynamics. We formalize the problem as an instance of online optimization in a dynamic setting. Using online gradient descent, we propose a method that sequentially predicts the value of the parameter and in turn suffers a loss. The objective is to minimize the accumulation of losses over the time horizon, a notion that is termed dynamic regret. While existing methods focus on convex loss functions, we consider strongly convex functions so as to provide better guarantees of performance. We derive a regret bound that captures the path-length of the time-varying parameter, defined in terms of the distance between its consecutive values. In other words, the bound represents the natural connection of tracking quality to the rate of change of the parameter. We provide numerical experiments to complement our theoretical findings. △ Less

Submitted 16 March, 2016; originally announced March 2016.

arXiv:1602.07245 [pdf, other]

doi 10.1007/JHEP07(2016)111

Weyl holographic superconductor in the Lifshitz black hole background

Authors: S. A. Hosseini Mansoori, B. Mirza, A. Mokhtari, F. Lalehgani Dezaki, Z. Sherkatghanad

Abstract: We investigate analytically the properties of the Weyl holographic superconductor in the Lifshitz black hole background. We find that the critical temperature of the Weyl superconductor decreases with increasing Lifshitz dynamical exponent, $z$, indicating that condensation becomes difficult. In addition, it is found that the critical temperature and condensation operator could be affected by appl… ▽ More We investigate analytically the properties of the Weyl holographic superconductor in the Lifshitz black hole background. We find that the critical temperature of the Weyl superconductor decreases with increasing Lifshitz dynamical exponent, $z$, indicating that condensation becomes difficult. In addition, it is found that the critical temperature and condensation operator could be affected by applying the Weyl coupling, $γ$. Moreover, we compute the critical magnetic field and investigate its dependence on the parameters $γ$ and $z$. Finally, we show numerically that the Weyl coupling parameter $γ$ and the Lifshitz dynamical exponent $z$ together control the size and strength of the conductivity peak and the ratio of gap frequency over critical temperature $ω_{g}/T_{c}$. △ Less

Submitted 21 July, 2016; v1 submitted 23 February, 2016; originally announced February 2016.

Comments: 25 pages, 22 figures

Journal ref: JHEP07(2016)111

arXiv:1602.01716 [pdf, other]

Decentralized Prediction-Correction Methods for Networked Time-Varying Convex Optimization

Authors: Andrea Simonetto, Alec Koppel, Aryan Mokhtari, Geert Leus, Alejandro Ribeiro

Abstract: We develop algorithms that find and track the optimal solution trajectory of time-varying convex optimization problems which consist of local and network-related objectives. The algorithms are derived from the prediction-correction methodology, which corresponds to a strategy where the time-varying problem is sampled at discrete time instances and then a sequence is generated via alternatively exe… ▽ More We develop algorithms that find and track the optimal solution trajectory of time-varying convex optimization problems which consist of local and network-related objectives. The algorithms are derived from the prediction-correction methodology, which corresponds to a strategy where the time-varying problem is sampled at discrete time instances and then a sequence is generated via alternatively executing predictions on how the optimizers at the next time sample are changing and corrections on how they actually have changed. Prediction is based on how the optimality conditions evolve in time, while correction is based on a gradient or Newton method, leading to Decentralized Prediction-Correction Gradient (DPC-G) and Decentralized Prediction-Correction Newton (DPC-N). We extend these methods to cases where the knowledge on how the optimization programs are changing in time is only approximate and propose Decentralized Approximate Prediction-Correction Gradient (DAPC-G) and Decentralized Approximate Prediction-Correction Newton (DAPC-N). Convergence properties of all the proposed methods are studied and empirical performance is shown on an application of a resource allocation problem in a wireless network. We observe that the proposed methods outperform existing running algorithms by orders of magnitude. The numerical results showcase a trade-off between convergence accuracy, sampling period, and network communications. △ Less

Submitted 7 November, 2016; v1 submitted 4 February, 2016; originally announced February 2016.

arXiv:1602.00596 [pdf, other]

A Decentralized Second-Order Method with Exact Linear Convergence Rate for Consensus Optimization

Authors: Aryan Mokhtari, Wei Shi, Qing Ling, Alejandro Ribeiro

Abstract: This paper considers decentralized consensus optimization problems where different summands of a global objective function are available at nodes of a network that can communicate with neighbors only. The proximal method of multipliers is considered as a powerful tool that relies on proximal primal descent and dual ascent updates on a suitably defined augmented Lagrangian. The structure of the aug… ▽ More This paper considers decentralized consensus optimization problems where different summands of a global objective function are available at nodes of a network that can communicate with neighbors only. The proximal method of multipliers is considered as a powerful tool that relies on proximal primal descent and dual ascent updates on a suitably defined augmented Lagrangian. The structure of the augmented Lagrangian makes this problem non-decomposable, which precludes distributed implementations. This problem is regularly addressed by the use of the alternating direction method of multipliers. The exact second order method (ESOM) is introduced here as an alternative that relies on: (i) The use of a separable quadratic approximation of the augmented Lagrangian. (ii) A truncated Taylor's series to estimate the solution of the first order condition imposed on the minimization of the quadratic approximation of the augmented Lagrangian. The sequences of primal and dual variables generated by ESOM are shown to converge linearly to their optimal arguments when the aggregate cost function is strongly convex and its gradients are Lipschitz continuous. Numerical results demonstrate advantages of ESOM relative to decentralized alternatives in solving least squares and logistic regression problems. △ Less

Submitted 1 February, 2016; originally announced February 2016.

arXiv:1510.07356 [pdf, other]

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Authors: Aryan Mokhtari, Wei Shi, Qing Ling, Alejandro Ribeiro

Abstract: This paper considers an optimization problem that components of the objective function are available at different nodes of a network and nodes are allowed to only exchange information with their neighbors. The decentralized alternating method of multipliers (DADMM) is a well-established iterative method for solving this category of problems; however, implementation of DADMM requires solving an opt… ▽ More This paper considers an optimization problem that components of the objective function are available at different nodes of a network and nodes are allowed to only exchange information with their neighbors. The decentralized alternating method of multipliers (DADMM) is a well-established iterative method for solving this category of problems; however, implementation of DADMM requires solving an optimization subproblem at each iteration for each node. This procedure is often computationally costly for the nodes. We introduce a decentralized quadratic approximation of ADMM (DQM) that reduces computational complexity of DADMM by minimizing a quadratic approximation of the objective function. Notwithstanding that DQM successively minimizes approximations of the cost, it converges to the optimal arguments at a linear rate which is identical to the convergence rate of DADMM. Further, we show that as time passes the coefficient of linear convergence for DQM approaches the one for DADMM. Numerical results demonstrate the effectiveness of DQM. △ Less

Submitted 25 October, 2015; originally announced October 2015.

Comments: arXiv admin note: substantial text overlap with arXiv:1508.02073

arXiv:1509.05196 [pdf, other]

doi 10.1109/TSP.2016.2568161

A Class of Prediction-Correction Methods for Time-Varying Convex Optimization

Authors: Andrea Simonetto, Aryan Mokhtari, Alec Koppel, Geert Leus, Alejandro Ribeiro

Abstract: This paper considers unconstrained convex optimization problems with time-varying objective functions. We propose algorithms with a discrete time-sampling scheme to find and track the solution trajectory based on prediction and correction steps, while sampling the problem data at a constant rate of $1/h$, where $h$ is the length of the sampling interval. The prediction step is derived by analyzing… ▽ More This paper considers unconstrained convex optimization problems with time-varying objective functions. We propose algorithms with a discrete time-sampling scheme to find and track the solution trajectory based on prediction and correction steps, while sampling the problem data at a constant rate of $1/h$, where $h$ is the length of the sampling interval. The prediction step is derived by analyzing the iso-residual dynamics of the optimality conditions. The correction step adjusts for the distance between the current prediction and the optimizer at each time step, and consists either of one or multiple gradient steps or Newton steps, which respectively correspond to the gradient trajectory tracking (GTT) or Newton trajectory tracking (NTT) algorithms. Under suitable conditions, we establish that the asymptotic error incurred by both proposed methods behaves as $O(h^2)$, and in some cases as $O(h^4)$, which outperforms the state-of-the-art error bound of $O(h)$ for correction-only methods in the gradient-correction step. Moreover, when the characteristics of the objective function variation are not available, we propose approximate gradient and Newton tracking algorithms (AGT and ANT, respectively) that still attain these asymptotical error bounds. Numerical simulations demonstrate the practical utility of the proposed methods and that they improve upon existing techniques by several orders of magnitude. △ Less

Submitted 11 May, 2016; v1 submitted 17 September, 2015; originally announced September 2015.

Comments: 16 pages, 8 figures

Journal ref: IEEE Transactions on Signal Processing, vol. 64 (17), pages 4576 - 4591, 2016

arXiv:1508.02073 [pdf, other]

doi 10.1109/TSP.2016.2548989

DQM: Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Authors: Aryan Mokhtari, Wei Shi, Qing Ling, Alejandro Ribeiro

Abstract: This paper considers decentralized consensus optimization problems where nodes of a network have access to different summands of a global objective function. Nodes cooperate to minimize the global objective by exchanging information with neighbors only. A decentralized version of the alternating directions method of multipliers (DADMM) is a common method for solving this category of problems. DADM… ▽ More This paper considers decentralized consensus optimization problems where nodes of a network have access to different summands of a global objective function. Nodes cooperate to minimize the global objective by exchanging information with neighbors only. A decentralized version of the alternating directions method of multipliers (DADMM) is a common method for solving this category of problems. DADMM exhibits linear convergence rate to the optimal objective but its implementation requires solving a convex optimization problem at each iteration. This can be computationally costly and may result in large overall convergence times. The decentralized quadratically approximated ADMM algorithm (DQM), which minimizes a quadratic approximation of the objective function that DADMM minimizes at each iteration, is proposed here. The consequent reduction in computational time is shown to have minimal effect on convergence properties. Convergence still proceeds at a linear rate with a guaranteed constant that is asymptotically equivalent to the DADMM linear convergence rate constant. Numerical results demonstrate advantages of DQM relative to DADMM and other alternatives in a logistic regression problem. △ Less

Submitted 9 August, 2015; originally announced August 2015.

Comments: 13 pages

arXiv:1506.04216 [pdf, ps, other]

DSA: Decentralized Double Stochastic Averaging Gradient Algorithm

Authors: Aryan Mokhtari, Alejandro Ribeiro

Abstract: This paper considers convex optimization problems where nodes of a network have access to summands of a global objective. Each of these local objectives is further assumed to be an average of a finite set of functions. The motivation for this setup is to solve large scale machine learning problems where elements of the training set are distributed to multiple computational elements. The decentrali… ▽ More This paper considers convex optimization problems where nodes of a network have access to summands of a global objective. Each of these local objectives is further assumed to be an average of a finite set of functions. The motivation for this setup is to solve large scale machine learning problems where elements of the training set are distributed to multiple computational elements. The decentralized double stochastic averaging gradient (DSA) algorithm is proposed as a solution alternative that relies on: (i) The use of local stochastic averaging gradients. (ii) Determination of descent steps as differences of consecutive stochastic averaging gradients. Strong convexity of local functions and Lipschitz continuity of local gradients is shown to guarantee linear convergence of the sequence generated by DSA in expectation. Local iterates are further shown to approach the optimal argument for almost all realizations. The expected linear convergence of DSA is in contrast to the sublinear rate characteristic of existing methods for decentralized stochastic optimization. Numerical experiments on a logistic regression problem illustrate reductions in convergence time and number of feature vectors processed until convergence relative to these other alternatives. △ Less

Submitted 12 June, 2015; originally announced June 2015.

arXiv:1505.02344 [pdf, ps, other]

More on Lie Derivations of Generalized Matrix Algebras

Authors: A. H. Mokhtari, H. R. Ebrahimi Vishki

Abstract: Motivated by the Cheung's elaborate work [Linear Multilinear Algebra, 51 (2003), 299-310], we investigate the construction of a Lie derivation on a generalized matrix algebra and apply it to give a characterization for such a Lie derivation to be proper. Our approach not only provides a direct proof for some known results in the theory, but also it presents several sufficient conditions assuring t… ▽ More Motivated by the Cheung's elaborate work [Linear Multilinear Algebra, 51 (2003), 299-310], we investigate the construction of a Lie derivation on a generalized matrix algebra and apply it to give a characterization for such a Lie derivation to be proper. Our approach not only provides a direct proof for some known results in the theory, but also it presents several sufficient conditions assuring the properness of Lie derivations on certain generalized matrix algebras. △ Less

Submitted 10 May, 2015; originally announced May 2015.

Comments: 11 pages

MSC Class: 16W25; 15A78; 47B47

Showing 51–100 of 113 results for author: Mokhtari, A