-
Tree Search for Simultaneous Move Games via Equilibrium Approximation
Authors:
Ryan Yu,
Alex Olshevsky,
Peter Chin
Abstract:
Neural network supported tree-search has shown strong results in a variety of perfect information multi-agent tasks. However, the performance of these methods on partial information games has generally been below competing approaches. Here we study the class of simultaneous-move games, which are a subclass of partial information games which are most similar to perfect information games: both agent…
▽ More
Neural network supported tree-search has shown strong results in a variety of perfect information multi-agent tasks. However, the performance of these methods on partial information games has generally been below competing approaches. Here we study the class of simultaneous-move games, which are a subclass of partial information games which are most similar to perfect information games: both agents know the game state with the exception of the opponent's move, which is revealed only after each agent makes its own move. Simultaneous move games include popular benchmarks such as Google Research Football and Starcraft.
In this study we answer the question: can we take tree search algorithms trained through self-play from perfect information settings and adapt them to simultaneous move games without significant loss of performance? We answer this question by deriving a practical method that attempts to approximate a coarse correlated equilibrium as a subroutine within a tree search. Our algorithm works on cooperative, competitive, and mixed tasks. Our results are better than the current best MARL algorithms on a wide range of accepted baseline environments.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
On Value Iteration Convergence in Connected MDPs
Authors:
Arsenii Mustafin,
Alex Olshevsky,
Ioannis Ch. Paschalidis
Abstract:
This paper establishes that an MDP with a unique optimal policy and ergodic associated transition matrix ensures the convergence of various versions of the Value Iteration algorithm at a geometric rate that exceeds the discount factor γ for both discounted and average-reward criteria.
This paper establishes that an MDP with a unique optimal policy and ergodic associated transition matrix ensures the convergence of various versions of the Value Iteration algorithm at a geometric rate that exceeds the discount factor γ for both discounted and average-reward criteria.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Sample Complexity of the Linear Quadratic Regulator: A Reinforcement Learning Lens
Authors:
Amirreza Neshaei Moghaddam,
Alex Olshevsky,
Bahman Gharesifard
Abstract:
We provide the first known algorithm that provably achieves $\varepsilon$-optimality within $\widetilde{\mathcal{O}}(1/\varepsilon)$ function evaluations for the discounted discrete-time LQR problem with unknown parameters, without relying on two-point gradient estimates. These estimates are known to be unrealistic in many settings, as they depend on using the exact same initialization, which is t…
▽ More
We provide the first known algorithm that provably achieves $\varepsilon$-optimality within $\widetilde{\mathcal{O}}(1/\varepsilon)$ function evaluations for the discounted discrete-time LQR problem with unknown parameters, without relying on two-point gradient estimates. These estimates are known to be unrealistic in many settings, as they depend on using the exact same initialization, which is to be selected randomly, for two different policies. Our results substantially improve upon the existing literature outside the realm of two-point gradient estimates, which either leads to $\widetilde{\mathcal{O}}(1/\varepsilon^2)$ rates or heavily relies on stability assumptions.
△ Less
Submitted 18 April, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
One-Shot Averaging for Distributed TD($λ$) Under Markov Sampling
Authors:
Haoxing Tian,
Ioannis Ch. Paschalidis,
Alex Olshevsky
Abstract:
We consider a distributed setup for reinforcement learning, where each agent has a copy of the same Markov Decision Process but transitions are sampled from the corresponding Markov chain independently by each agent. We show that in this setting, we can achieve a linear speedup for TD($λ$), a family of popular methods for policy evaluation, in the sense that $N$ agents can evaluate a policy $N$ ti…
▽ More
We consider a distributed setup for reinforcement learning, where each agent has a copy of the same Markov Decision Process but transitions are sampled from the corresponding Markov chain independently by each agent. We show that in this setting, we can achieve a linear speedup for TD($λ$), a family of popular methods for policy evaluation, in the sense that $N$ agents can evaluate a policy $N$ times faster provided the target accuracy is small enough. Notably, this speedup is achieved by ``one shot averaging,'' a procedure where the agents run TD($λ$) with Markov sampling independently and only average their results after the final step. This significantly reduces the amount of communication required to achieve a linear speedup relative to previous work.
△ Less
Submitted 31 May, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Convex SGD: Generalization Without Early Stop**
Authors:
Julien Hendrickx,
Alex Olshevsky
Abstract:
We consider the generalization error associated with stochastic gradient descent on a smooth convex function over a compact set. We show the first bound on the generalization error that vanishes when the number of iterations $T$ and the dataset size $n$ go to zero at arbitrary rates; our bound scales as $\tilde{O}(1/\sqrt{T} + 1/\sqrt{n})$ with step-size $α_t = 1/\sqrt{t}$. In particular, strong c…
▽ More
We consider the generalization error associated with stochastic gradient descent on a smooth convex function over a compact set. We show the first bound on the generalization error that vanishes when the number of iterations $T$ and the dataset size $n$ go to zero at arbitrary rates; our bound scales as $\tilde{O}(1/\sqrt{T} + 1/\sqrt{n})$ with step-size $α_t = 1/\sqrt{t}$. In particular, strong convexity is not needed for stochastic gradient descent to generalize well.
△ Less
Submitted 14 April, 2024; v1 submitted 8 January, 2024;
originally announced January 2024.
-
On the Performance of Temporal Difference Learning With Neural Networks
Authors:
Haoxing Tian,
Ioannis Ch. Paschalidis,
Alex Olshevsky
Abstract:
Neural Temporal Difference (TD) Learning is an approximate temporal difference method for policy evaluation that uses a neural network for function approximation. Analysis of Neural TD Learning has proven to be challenging. In this paper we provide a convergence analysis of Neural TD Learning with a projection onto $B(θ_0, ω)$, a ball of fixed radius $ω$ around the initial point $θ_0$. We show an…
▽ More
Neural Temporal Difference (TD) Learning is an approximate temporal difference method for policy evaluation that uses a neural network for function approximation. Analysis of Neural TD Learning has proven to be challenging. In this paper we provide a convergence analysis of Neural TD Learning with a projection onto $B(θ_0, ω)$, a ball of fixed radius $ω$ around the initial point $θ_0$. We show an approximation bound of $O(ε) + \tilde{O} (1/\sqrt{m})$ where $ε$ is the approximation quality of the best neural network in $B(θ_0, ω)$ and $m$ is the width of all hidden layers in the network.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
Distributed TD(0) with Almost No Communication
Authors:
Rui Liu,
Alex Olshevsky
Abstract:
We provide a new non-asymptotic analysis of distributed temporal difference learning with linear function approximation. Our approach relies on ``one-shot averaging,'' where $N$ agents run identical local copies of the TD(0) method and average the outcomes only once at the very end. We demonstrate a version of the linear time speedup phenomenon, where the convergence time of the distributed proces…
▽ More
We provide a new non-asymptotic analysis of distributed temporal difference learning with linear function approximation. Our approach relies on ``one-shot averaging,'' where $N$ agents run identical local copies of the TD(0) method and average the outcomes only once at the very end. We demonstrate a version of the linear time speedup phenomenon, where the convergence time of the distributed process is a factor of $N$ faster than the convergence time of TD(0). This is the first result proving benefits from parallelism for temporal difference methods.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
Closing the gap between SVRG and TD-SVRG with Gradient Splitting
Authors:
Arsenii Mustafin,
Alex Olshevsky,
Ioannis Ch. Paschalidis
Abstract:
Temporal difference (TD) learning is a policy evaluation in reinforcement learning whose performance can be enhanced by variance reduction methods. Recently, multiple works have sought to fuse TD learning with Stochastic Variance Reduced Gradient (SVRG) method to achieve a geometric rate of convergence. However, the resulting convergence rate is significantly weaker than what is achieved by SVRG i…
▽ More
Temporal difference (TD) learning is a policy evaluation in reinforcement learning whose performance can be enhanced by variance reduction methods. Recently, multiple works have sought to fuse TD learning with Stochastic Variance Reduced Gradient (SVRG) method to achieve a geometric rate of convergence. However, the resulting convergence rate is significantly weaker than what is achieved by SVRG in the setting of convex optimization. In this work we utilize a recent interpretation of TD-learning as the splitting of the gradient of an appropriately chosen function, thus simplifying the algorithm and fusing TD with SVRG. Our main result is a geometric convergence bound with predetermined learning rate of $1/8$, which is identical to the convergence bound available for SVRG in the convex setting. Our theoretical findings are supported by a set of experiments.
△ Less
Submitted 20 March, 2024; v1 submitted 29 November, 2022;
originally announced November 2022.
-
A Small Gain Analysis of Single Timescale Actor Critic
Authors:
Alex Olshevsky,
Bahman Gharesifard
Abstract:
We consider a version of actor-critic which uses proportional step-sizes and only one critic update with a single sample from the stationary distribution per actor step. We provide an analysis of this method using the small-gain theorem. Specifically, we prove that this method can be used to find a stationary point, and that the resulting sample complexity improves the state of the art for actor-c…
▽ More
We consider a version of actor-critic which uses proportional step-sizes and only one critic update with a single sample from the stationary distribution per actor step. We provide an analysis of this method using the small-gain theorem. Specifically, we prove that this method can be used to find a stationary point, and that the resulting sample complexity improves the state of the art for actor-critic methods to $O \left(μ^{-2} ε^{-2} \right)$ to find an $ε$-approximate stationary point where $μ$ is the condition number associated with the critic.
△ Less
Submitted 25 May, 2023; v1 submitted 4 March, 2022;
originally announced March 2022.
-
Communication-efficient SGD: From Local SGD to One-Shot Averaging
Authors:
Artin Spiridonoff,
Alex Olshevsky,
Ioannis Ch. Paschalidis
Abstract:
We consider speeding up stochastic gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among $N$ workers, who can take SGD steps and coordinate with a central server. While it is possible to obtain a linear reduction in the variance by averaging all the stochastic gradients at every step, this requires a lot of communication between the workers…
▽ More
We consider speeding up stochastic gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among $N$ workers, who can take SGD steps and coordinate with a central server. While it is possible to obtain a linear reduction in the variance by averaging all the stochastic gradients at every step, this requires a lot of communication between the workers and the server, which can dramatically reduce the gains from parallelism. The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications. While the initial analysis of Local SGD showed it needs $Ω( \sqrt{T} )$ communications for $T$ local gradient steps in order for the error to scale proportionately to $1/(NT)$, this has been successively improved in a string of papers, with the state of the art requiring $Ω\left( N \left( \mbox{ poly} (\log T) \right) \right)$ communications. In this paper, we suggest a Local SGD scheme that communicates less overall by communicating less frequently as the number of iterations grows. Our analysis shows that this can achieve an error that scales as $1/(NT)$ with a number of communications that is completely independent of $T$. In particular, we show that $Ω(N)$ communications are sufficient. Empirical evidence suggests this bound is close to tight as we further show that $\sqrt{N}$ or $N^{3/4}$ communications fail to achieve linear speed-up in simulations. Moreover, we show that under mild assumptions, the main of which is twice differentiability on any neighborhood of the optimal solution, one-shot averaging which only uses a single round of communication can also achieve the optimal convergence rate asymptotically.
△ Less
Submitted 27 October, 2021; v1 submitted 8 June, 2021;
originally announced June 2021.
-
Distributed TD(0) with Almost No Communication
Authors:
Rui Liu,
Alex Olshevsky
Abstract:
We provide a new non-asymptotic analysis of distributed TD(0) with linear function approximation. Our approach relies on "one-shot averaging," where $N$ agents run local copies of TD(0) and average the outcomes only once at the very end. We consider two models: one in which the agents interact with an environment they can observe and whose transitions depends on all of their actions (which we call…
▽ More
We provide a new non-asymptotic analysis of distributed TD(0) with linear function approximation. Our approach relies on "one-shot averaging," where $N$ agents run local copies of TD(0) and average the outcomes only once at the very end. We consider two models: one in which the agents interact with an environment they can observe and whose transitions depends on all of their actions (which we call the global state model), and one in which each agent can run a local copy of an identical Markov Decision Process, which we call the local state model.
In the global state model, we show that the convergence rate of our distributed one-shot averaging method matches the known convergence rate of TD(0). By contrast, the best convergence rate in the previous literature showed a rate which, according to the worst-case bounds given, could underperform the non-distributed version by $O(N^3)$ in terms of the number of agents $N$. In the local state model, we demonstrate a version of the linear time speedup phenomenon, where the convergence time of the distributed process is a factor of $N$ faster than the convergence time of TD(0). As far as we are aware, this is the first result rigorously showing benefits from parallelism for temporal difference methods.
△ Less
Submitted 27 January, 2022; v1 submitted 15 April, 2021;
originally announced April 2021.
-
Temporal Difference Learning as Gradient Splitting
Authors:
Rui Liu,
Alex Olshevsky
Abstract:
Temporal difference learning with linear function approximation is a popular method to obtain a low-dimensional approximation of the value function of a policy in a Markov Decision Process. We give a new interpretation of this method in terms of a splitting of the gradient of an appropriately chosen function. As a consequence of this interpretation, convergence proofs for gradient descent can be a…
▽ More
Temporal difference learning with linear function approximation is a popular method to obtain a low-dimensional approximation of the value function of a policy in a Markov Decision Process. We give a new interpretation of this method in terms of a splitting of the gradient of an appropriately chosen function. As a consequence of this interpretation, convergence proofs for gradient descent can be applied almost verbatim to temporal difference learning. Beyond giving a new, fuller explanation of why temporal difference works, our interpretation also yields improved convergence times. We consider the setting with $1/\sqrt{T}$ step-size, where previous comparable finite-time convergence time bounds for temporal difference learning had the multiplicative factor $1/(1-γ)$ in front of the bound, with $γ$ being the discount factor. We show that a minor variation on TD learning which estimates the mean of the value function separately has a convergence time where $1/(1-γ)$ only multiplies an asymptotically negligible term.
△ Less
Submitted 27 October, 2020;
originally announced October 2020.
-
Adversarial Crowdsourcing Through Robust Rank-One Matrix Completion
Authors:
Qianqian Ma,
Alex Olshevsky
Abstract:
We consider the problem of reconstructing a rank-one matrix from a revealed subset of its entries when some of the revealed entries are corrupted with perturbations that are unknown and can be arbitrarily large. It is not known which revealed entries are corrupted. We propose a new algorithm combining alternating minimization with extreme-value filtering and provide sufficient and necessary condit…
▽ More
We consider the problem of reconstructing a rank-one matrix from a revealed subset of its entries when some of the revealed entries are corrupted with perturbations that are unknown and can be arbitrarily large. It is not known which revealed entries are corrupted. We propose a new algorithm combining alternating minimization with extreme-value filtering and provide sufficient and necessary conditions to recover the original rank-one matrix. In particular, we show that our proposed algorithm is optimal when the set of revealed entries is given by an Erdős-Rényi random graph. These results are then applied to the problem of classification from crowdsourced data under the assumption that while the majority of the workers are governed by the standard single-coin David-Skene model (i.e., they output the correct answer with a certain probability), some of the workers can deviate arbitrarily from this model. In particular, the "adversarial" workers could even make decisions designed to make the algorithm output an incorrect answer. Extensive experimental results show our algorithm for this problem, based on rank-one matrix completion with perturbations, outperforms all other state-of-the-art methods in such an adversarial scenario.
△ Less
Submitted 23 October, 2020;
originally announced October 2020.
-
Asymptotic Convergence Rate of Alternating Minimization for Rank One Matrix Completion
Authors:
Rui Liu,
Alex Olshevsky
Abstract:
We study alternating minimization for matrix completion in the simplest possible setting: completing a rank-one matrix from a revealed subset of the entries. We bound the asymptotic convergence rate by the variational characterization of eigenvalues of a reversible consensus problem. This leads to a polynomial upper bound on the asymptotic rate in terms of number of nodes as well as the largest de…
▽ More
We study alternating minimization for matrix completion in the simplest possible setting: completing a rank-one matrix from a revealed subset of the entries. We bound the asymptotic convergence rate by the variational characterization of eigenvalues of a reversible consensus problem. This leads to a polynomial upper bound on the asymptotic rate in terms of number of nodes as well as the largest degree of the graph of revealed entries.
△ Less
Submitted 11 August, 2020;
originally announced August 2020.
-
Local SGD With a Communication Overhead Depending Only on the Number of Workers
Authors:
Artin Spiridonoff,
Alex Olshevsky,
Ioannis Ch. Paschalidis
Abstract:
We consider speeding up stochastic gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among $n$ workers, who can take SGD steps and coordinate with a central server. Unfortunately, this could require a lot of communication between the workers and the server, which can dramatically reduce the gains from parallelism. The Local SGD method, propos…
▽ More
We consider speeding up stochastic gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among $n$ workers, who can take SGD steps and coordinate with a central server. Unfortunately, this could require a lot of communication between the workers and the server, which can dramatically reduce the gains from parallelism. The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications. While the initial analysis of Local SGD showed it needs $Ω( \sqrt{T} )$ communications for $T$ local gradient steps in order for the error to scale proportionately to $1/(nT)$, this has been successively improved in a string of papers, with the state-of-the-art requiring $Ω\left( n \left( \mbox{ polynomial in log } (T) \right) \right)$ communications. In this paper, we give a new analysis of Local SGD. A consequence of our analysis is that Local SGD can achieve an error that scales as $1/(nT)$ with only a fixed number of communications independent of $T$: specifically, only $Ω(n)$ communications are required.
△ Less
Submitted 3 June, 2020;
originally announced June 2020.
-
Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning
Authors:
Shi Pu,
Alex Olshevsky,
Ioannis Ch. Paschalidis
Abstract:
We provide a discussion of several recent results which, in certain scenarios, are able to overcome a barrier in distributed stochastic optimization for machine learning. Our focus is the so-called asymptotic network independence property, which is achieved whenever a distributed method executed over a network of n nodes asymptotically converges to the optimal solution at a comparable rate to a ce…
▽ More
We provide a discussion of several recent results which, in certain scenarios, are able to overcome a barrier in distributed stochastic optimization for machine learning. Our focus is the so-called asymptotic network independence property, which is achieved whenever a distributed method executed over a network of n nodes asymptotically converges to the optimal solution at a comparable rate to a centralized method with the same computational power as the entire network. We explain this property through an example involving the training of ML models and sketch a short mathematical analysis for comparing the performance of distributed stochastic gradient descent (DSGD) with centralized stochastic gradient decent (SGD).
△ Less
Submitted 18 February, 2020; v1 submitted 28 June, 2019;
originally announced June 2019.
-
A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent
Authors:
Shi Pu,
Alex Olshevsky,
Ioannis Ch. Paschalidis
Abstract:
This paper is concerned with minimizing the average of $n$ cost functions over a network in which agents may communicate and exchange information with each other. We consider the setting where only noisy gradient information is available. To solve the problem, we study the distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex a…
▽ More
This paper is concerned with minimizing the average of $n$ cost functions over a network in which agents may communicate and exchange information with each other. We consider the setting where only noisy gradient information is available. To solve the problem, we study the distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex and smooth objective functions, DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD). Our main contribution is to characterize the transient time needed for DSGD to approach the asymptotic convergence rate, which we show behaves as $K_T=\mathcal{O}\left(\frac{n}{(1-ρ_w)^2}\right)$, where $1-ρ_w$ denotes the spectral gap of the mixing matrix. Moreover, we construct a "hard" optimization problem for which we show the transient time needed for DSGD to approach the asymptotic convergence rate is lower bounded by $Ω\left(\frac{n}{(1-ρ_w)^2} \right)$, implying the sharpness of the obtained result. Numerical experiments demonstrate the tightness of the theoretical results.
△ Less
Submitted 29 January, 2021; v1 submitted 6 June, 2019;
originally announced June 2019.
-
Gradient Descent for Sparse Rank-One Matrix Completion for Crowd-Sourced Aggregation of Sparsely Interacting Workers
Authors:
Yao Ma,
Alex Olshevsky,
Venkatesh Saligrama,
Csaba Szepesvari
Abstract:
We consider worker skill estimation for the single-coin Dawid-Skene crowdsourcing model. In practice, skill-estimation is challenging because worker assignments are sparse and irregular due to the arbitrary and uncontrolled availability of workers. We formulate skill estimation as a rank-one correlation-matrix completion problem, where the observed components correspond to observed label correlati…
▽ More
We consider worker skill estimation for the single-coin Dawid-Skene crowdsourcing model. In practice, skill-estimation is challenging because worker assignments are sparse and irregular due to the arbitrary and uncontrolled availability of workers. We formulate skill estimation as a rank-one correlation-matrix completion problem, where the observed components correspond to observed label correlations between workers. We show that the correlation matrix can be successfully recovered and skills are identifiable if and only if the sampling matrix (observed components) does not have a bipartite connected component. We then propose a projected gradient descent scheme and show that skill estimates converge to the desired global optima for such sampling matrices. Our proof is original and the results are surprising in light of the fact that even the weighted rank-one matrix factorization problem is NP-hard in general. Next, we derive sample complexity bounds in terms of spectral properties of the signless Laplacian of the sampling matrix. Our proposed scheme achieves state-of-art performance on a number of real-world datasets.
△ Less
Submitted 23 July, 2020; v1 submitted 25 April, 2019;
originally announced April 2019.
-
Graph Resistance and Learning from Pairwise Comparisons
Authors:
Julien M. Hendrickx,
Alex Olshevsky,
Venkatesh Saligrama
Abstract:
We consider the problem of learning the qualities of a collection of items by performing noisy comparisons among them. Following the standard paradigm, we assume there is a fixed "comparison graph" and every neighboring pair of items in this graph is compared $k$ times according to the Bradley-Terry-Luce model (where the probability than an item wins a comparison is proportional the item quality).…
▽ More
We consider the problem of learning the qualities of a collection of items by performing noisy comparisons among them. Following the standard paradigm, we assume there is a fixed "comparison graph" and every neighboring pair of items in this graph is compared $k$ times according to the Bradley-Terry-Luce model (where the probability than an item wins a comparison is proportional the item quality). We are interested in how the relative error in quality estimation scales with the comparison graph in the regime where $k$ is large. We prove that, after a known transition period, the relevant graph-theoretic quantity is the square root of the resistance of the comparison graph. Specifically, we provide an algorithm that is minimax optimal. The algorithm has a relative error decay that scales with the square root of the graph resistance, and provide a matching lower bound (up to log factors). The performance guarantee of our algorithm, both in terms of the graph and the skewness of the item quality distribution, outperforms earlier results.
△ Less
Submitted 12 June, 2019; v1 submitted 31 January, 2019;
originally announced February 2019.
-
Graph-Theoretic Analysis of Belief System Dynamics under Logic Constraints
Authors:
Angelia Nedić,
Alex Olshevsky,
César A. Uribe
Abstract:
Opinion formation cannot be modeled solely as an ideological deduction from a set of principles; rather, repeated social interactions and logic constraints among statements are consequential in the construct of belief systems. We address three basic questions in the analysis of social opinion dynamics: (i) Will a belief system converge? (ii) How long does it take to converge? (iii) Where does it c…
▽ More
Opinion formation cannot be modeled solely as an ideological deduction from a set of principles; rather, repeated social interactions and logic constraints among statements are consequential in the construct of belief systems. We address three basic questions in the analysis of social opinion dynamics: (i) Will a belief system converge? (ii) How long does it take to converge? (iii) Where does it converge? We provide graph-theoretic answers to these questions for a model of opinion dynamics of a belief system with logic constraints. Our results make plain the implicit dependence of the convergence properties of a belief system on the underlying social network and on the set of logic constraints that relate beliefs on different statements. Moreover, we provide an explicit analysis of a variety of commonly used large-scale network models.
△ Less
Submitted 30 December, 2018; v1 submitted 4 October, 2018;
originally announced October 2018.
-
Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization
Authors:
Angelia Nedić,
Alex Olshevsky,
Michael G. Rabbat
Abstract:
In decentralized optimization, nodes cooperate to minimize an overall objective function that is the sum (or average) of per-node private objective functions. Algorithms interleave local computations with communication among all or a subset of the nodes. Motivated by a variety of applications---distributed estimation in sensor networks, fitting models to massive data sets, and distributed control…
▽ More
In decentralized optimization, nodes cooperate to minimize an overall objective function that is the sum (or average) of per-node private objective functions. Algorithms interleave local computations with communication among all or a subset of the nodes. Motivated by a variety of applications---distributed estimation in sensor networks, fitting models to massive data sets, and distributed control of multi-robot systems, to name a few---significant advances have been made towards the development of robust, practical algorithms with theoretical performance guarantees. This paper presents an overview of recent work in this area. In general, rates of convergence depend not only on the number of nodes involved and the desired level of accuracy, but also on the structure and nature of the network over which nodes communicate (e.g., whether links are directed or undirected, static or time-varying). We survey the state-of-the-art algorithms and their analyses tailored to these different scenarios, highlighting the role of the network topology.
△ Less
Submitted 15 January, 2018; v1 submitted 25 September, 2017;
originally announced September 2017.
-
Crowdsourcing with Sparsely Interacting Workers
Authors:
Yao Ma,
Alex Olshevsky,
Venkatesh Saligrama,
Csaba Szepesvari
Abstract:
We consider estimation of worker skills from worker-task interaction data (with unknown labels) for the single-coin crowd-sourcing binary classification model in symmetric noise. We define the (worker) interaction graph whose nodes are workers and an edge between two nodes indicates whether or not the two workers participated in a common task. We show that skills are asymptotically identifiable if…
▽ More
We consider estimation of worker skills from worker-task interaction data (with unknown labels) for the single-coin crowd-sourcing binary classification model in symmetric noise. We define the (worker) interaction graph whose nodes are workers and an edge between two nodes indicates whether or not the two workers participated in a common task. We show that skills are asymptotically identifiable if and only if an appropriate limiting version of the interaction graph is irreducible and has odd-cycles. We then formulate a weighted rank-one optimization problem to estimate skills based on observations on an irreducible, aperiodic interaction graph. We propose a gradient descent scheme and show that for such interaction graphs estimates converge asymptotically to the global minimum. We characterize noise robustness of the gradient scheme in terms of spectral properties of signless Laplacians of the interaction graph. We then demonstrate that a plug-in estimator based on the estimated skills achieves state-of-art performance on a number of real-world datasets. Our results have implications for rank-one matrix completion problem in that gradient descent can provably recover $W \times W$ rank-one matrices based on $W+1$ off-diagonal observations of a connected graph with a single odd-cycle.
△ Less
Submitted 20 June, 2017;
originally announced June 2017.
-
Improved Convergence Rates for Distributed Resource Allocation
Authors:
Angelia Nedić,
Alex Olshevsky,
Wei Shi
Abstract:
In this paper, we develop a class of decentralized algorithms for solving a convex resource allocation problem in a network of $n$ agents, where the agent objectives are decoupled while the resource constraints are coupled. The agents communicate over a connected undirected graph, and they want to collaboratively determine a solution to the overall network problem, while each agent only communicat…
▽ More
In this paper, we develop a class of decentralized algorithms for solving a convex resource allocation problem in a network of $n$ agents, where the agent objectives are decoupled while the resource constraints are coupled. The agents communicate over a connected undirected graph, and they want to collaboratively determine a solution to the overall network problem, while each agent only communicates with its neighbors. We first study the connection between the decentralized resource allocation problem and the decentralized consensus optimization problem. Then, using a class of algorithms for solving consensus optimization problems, we propose a novel class of decentralized schemes for solving resource allocation problems in a distributed manner. Specifically, we first propose an algorithm for solving the resource allocation problem with an $o(1/k)$ convergence rate guarantee when the agents' objective functions are generally convex (could be nondifferentiable) and per agent local convex constraints are allowed; We then propose a gradient-based algorithm for solving the resource allocation problem when per agent local constraints are absent and show that such scheme can achieve geometric rate when the objective functions are strongly convex and have Lipschitz continuous gradients. We have also provided scalability/network dependency analysis. Based on these two algorithms, we have further proposed a gradient projection-based algorithm which can handle smooth objective and simple constraints more efficiently. Numerical experiments demonstrates the viability and performance of all the proposed algorithms.
△ Less
Submitted 16 December, 2018; v1 submitted 16 June, 2017;
originally announced June 2017.
-
Distributed Learning for Cooperative Inference
Authors:
Angelia Nedić,
Alex Olshevsky,
César A. Uribe
Abstract:
We study the problem of cooperative inference where a group of agents interact over a network and seek to estimate a joint parameter that best explains a set of observations. Agents do not know the network topology or the observations of other agents. We explore a variational interpretation of the Bayesian posterior density, and its relation to the stochastic mirror descent algorithm, to propose a…
▽ More
We study the problem of cooperative inference where a group of agents interact over a network and seek to estimate a joint parameter that best explains a set of observations. Agents do not know the network topology or the observations of other agents. We explore a variational interpretation of the Bayesian posterior density, and its relation to the stochastic mirror descent algorithm, to propose a new distributed learning algorithm. We show that, under appropriate assumptions, the beliefs generated by the proposed algorithm concentrate around the true parameter exponentially fast. We provide explicit non-asymptotic bounds for the convergence rate. Moreover, we develop explicit and computationally efficient algorithms for observation models belonging to exponential families.
△ Less
Submitted 10 April, 2017;
originally announced April 2017.
-
Distributed Gaussian Learning over Time-varying Directed Graphs
Authors:
Angelia Nedić,
Alex Olshevsky,
César A. Uribe
Abstract:
We present a distributed (non-Bayesian) learning algorithm for the problem of parameter estimation with Gaussian noise. The algorithm is expressed as explicit updates on the parameters of the Gaussian beliefs (i.e. means and precision). We show a convergence rate of $O(1/k)$ with the constant term depending on the number of agents and the topology of the network. Moreover, we show almost sure conv…
▽ More
We present a distributed (non-Bayesian) learning algorithm for the problem of parameter estimation with Gaussian noise. The algorithm is expressed as explicit updates on the parameters of the Gaussian beliefs (i.e. means and precision). We show a convergence rate of $O(1/k)$ with the constant term depending on the number of agents and the topology of the network. Moreover, we show almost sure convergence to the optimal solution of the estimation problem for the general case of time-varying directed graphs.
△ Less
Submitted 6 December, 2016; v1 submitted 5 December, 2016;
originally announced December 2016.
-
A Tutorial on Distributed (Non-Bayesian) Learning: Problem, Algorithms and Results
Authors:
Angelia Nedić,
Alex Olshevsky,
César A. Uribe
Abstract:
We overview some results on distributed learning with focus on a family of recently proposed algorithms known as non-Bayesian social learning. We consider different approaches to the distributed learning problem and its algorithmic solutions for the case of finitely many hypotheses. The original centralized problem is discussed at first, and then followed by a generalization to the distributed set…
▽ More
We overview some results on distributed learning with focus on a family of recently proposed algorithms known as non-Bayesian social learning. We consider different approaches to the distributed learning problem and its algorithmic solutions for the case of finitely many hypotheses. The original centralized problem is discussed at first, and then followed by a generalization to the distributed setting. The results on convergence and convergence rate are presented for both asymptotic and finite time regimes. Various extensions are discussed such as those dealing with directed time-varying networks, Nesterov's acceleration technique and a continuum sets of hypothesis.
△ Less
Submitted 23 September, 2016;
originally announced September 2016.
-
Distributed Learning with Infinitely Many Hypotheses
Authors:
Angelia Nedić,
Alex Olshevsky,
César Uribe
Abstract:
We consider a distributed learning setup where a network of agents sequentially access realizations of a set of random variables with unknown distributions. The network objective is to find a parametrized distribution that best describes their joint observations in the sense of the Kullback-Leibler divergence. Apart from recent efforts in the literature, we analyze the case of countably many hypot…
▽ More
We consider a distributed learning setup where a network of agents sequentially access realizations of a set of random variables with unknown distributions. The network objective is to find a parametrized distribution that best describes their joint observations in the sense of the Kullback-Leibler divergence. Apart from recent efforts in the literature, we analyze the case of countably many hypotheses and the case of a continuum of hypotheses. We provide non-asymptotic bounds for the concentration rate of the agents' beliefs around the correct hypothesis in terms of the number of agents, the network parameters, and the learning abilities of the agents. Additionally, we provide a novel motivation for a general set of distributed Non-Bayesian update rules as instances of the distributed stochastic mirror descent algorithm.
△ Less
Submitted 6 May, 2016;
originally announced May 2016.
-
Scaling laws for consensus protocols subject to noise
Authors:
Ali Jadbabaie,
Alex Olshevsky
Abstract:
We study the performance of discrete-time consensus protocols in the presence of additive noise. When the consensus dynamic corresponds to a reversible Markov chain, we give an exact expression for a weighted version of steady-state disagreement in terms of the stationary distribution and hitting times in an underlying graph. We then show how this result can be used to characterize the noise robus…
▽ More
We study the performance of discrete-time consensus protocols in the presence of additive noise. When the consensus dynamic corresponds to a reversible Markov chain, we give an exact expression for a weighted version of steady-state disagreement in terms of the stationary distribution and hitting times in an underlying graph. We then show how this result can be used to characterize the noise robustness of a class of protocols for formation control in terms of the Kemeny constant of an underlying graph.
△ Less
Submitted 7 March, 2017; v1 submitted 31 July, 2015;
originally announced August 2015.
-
Convergence Time of Quantized Metropolis Consensus Over Time-Varying Networks
Authors:
Tamer Basar,
Seyed Rasoul Etesami,
Alex Olshevsky
Abstract:
We consider the quantized consensus problem on undirected time-varying connected graphs with n nodes, and devise a protocol with fast convergence time to the set of consensus points. Specifically, we show that when the edges of each network in a sequence of connected time-varying networks are activated based on Poisson processes with Metropolis rates, the expected convergence time to the set of co…
▽ More
We consider the quantized consensus problem on undirected time-varying connected graphs with n nodes, and devise a protocol with fast convergence time to the set of consensus points. Specifically, we show that when the edges of each network in a sequence of connected time-varying networks are activated based on Poisson processes with Metropolis rates, the expected convergence time to the set of consensus points is at most O(n^2 log^2 n), where each node performs a constant number of updates per unit time.
△ Less
Submitted 2 February, 2016; v1 submitted 6 April, 2015;
originally announced April 2015.
-
Linear Time Average Consensus on Fixed Graphs and Implications for Decentralized Optimization and Multi-Agent Control
Authors:
Alex Olshevsky
Abstract:
We describe a protocol for the average consensus problem on any fixed undirected graph whose convergence time scales linearly in the total number nodes $n$. The protocol is completely distributed, with the exception of requiring all nodes to know the same upper bound $U$ on the total number of nodes which is correct within a constant multiplicative factor.
We next discuss applications of this pr…
▽ More
We describe a protocol for the average consensus problem on any fixed undirected graph whose convergence time scales linearly in the total number nodes $n$. The protocol is completely distributed, with the exception of requiring all nodes to know the same upper bound $U$ on the total number of nodes which is correct within a constant multiplicative factor.
We next discuss applications of this protocol to problems in multi-agent control connected to the consensus problem. In particular, we describe protocols for formation maintenance and leader-following with convergence times which also scale linearly with the number of nodes.
Finally, we develop a distributed protocol for minimizing an average of (possibly nondifferentiable) convex functions $ (1/n) \sum_{i=1}^n f_i(θ)$, in the setting where only node $i$ in an undirected, connected graph knows the function $f_i(θ)$. Under the same assumption about all nodes knowing $U$, and additionally assuming that the subgradients of each $f_i(θ)$ have absolute values upper bounded by some constant $L$ known to the nodes, we show that after $T$ iterations our protocol has error which is $O(L \sqrt{n/T})$.
△ Less
Submitted 3 August, 2017; v1 submitted 15 November, 2014;
originally announced November 2014.
-
On symmetric continuum opinion dynamics
Authors:
Julien M. Hendrickx,
Alex Olshevsky
Abstract:
This paper investigates the asymptotic behavior of some common opinion dynamic models in a continuum of agents. We show that as long as the interactions among the agents are symmetric, the distribution of the agents' opinion converges. We also investigate whether convergence occurs in a stronger sense than merely in distribution, namely, whether the opinion of almost every agent converges. We show…
▽ More
This paper investigates the asymptotic behavior of some common opinion dynamic models in a continuum of agents. We show that as long as the interactions among the agents are symmetric, the distribution of the agents' opinion converges. We also investigate whether convergence occurs in a stronger sense than merely in distribution, namely, whether the opinion of almost every agent converges. We show that while this is not the case in general, it becomes true under plausible assumptions on inter-agent interactions, namely that agents with similar opinions exert a non-negligible pull on each other, or that the interactions are entirely determined by their opinions via a smooth function.
△ Less
Submitted 10 August, 2016; v1 submitted 2 November, 2013;
originally announced November 2013.
-
On Primitivity of Sets of Matrices
Authors:
Vincent D. Blondel,
Raphael M. Jungers,
Alex Olshevsky
Abstract:
A nonnegative matrix $A$ is called primitive if $A^k$ is positive for some integer $k>0$. A generalization of this concept to finite sets of matrices is as follows: a set of matrices $\mathcal M = \{A_1, A_2, \ldots, A_m \}$ is primitive if $A_{i_1} A_{i_2} \ldots A_{i_k}$ is positive for some indices $i_1, i_2, ..., i_k$. The concept of primitive sets of matrices comes up in a number of problems…
▽ More
A nonnegative matrix $A$ is called primitive if $A^k$ is positive for some integer $k>0$. A generalization of this concept to finite sets of matrices is as follows: a set of matrices $\mathcal M = \{A_1, A_2, \ldots, A_m \}$ is primitive if $A_{i_1} A_{i_2} \ldots A_{i_k}$ is positive for some indices $i_1, i_2, ..., i_k$. The concept of primitive sets of matrices comes up in a number of problems within the study of discrete-time switched systems. In this paper, we analyze the computational complexity of deciding if a given set of matrices is primitive and we derive bounds on the length of the shortest positive product.
We show that while primitivity is algorithmically decidable, unless $P=NP$ it is not possible to decide primitivity of a matrix set in polynomial time. Moreover, we show that the length of the shortest positive sequence can be superpolynomial in the dimension of the matrices. On the other hand, defining ${\mathcal P}$ to be the set of matrices with no zero rows or columns, we give a simple combinatorial proof of a previously-known characterization of primitivity for matrices in ${\mathcal P}$ which can be tested in polynomial time. This latter observation is related to the well-known 1964 conjecture of Cerny on synchronizing automata; in fact, any bound on the minimal length of a synchronizing word for synchronizing automata immediately translates into a bound on the length of the shortest positive product of a primitive set of matrices in ${\mathcal P}$. In particular, any primitive set of $n \times n$ matrices in ${\mathcal P}$ has a positive product of length $O(n^3)$.
△ Less
Submitted 15 April, 2015; v1 submitted 4 June, 2013;
originally announced June 2013.
-
Distributed optimization over time-varying directed graphs
Authors:
Angelia Nedic,
Alex Olshevsky
Abstract:
We consider distributed optimization by a collection of nodes, each having access to its own convex function, whose collective goal is to minimize the sum of the functions. The communications between nodes are described by a time-varying sequence of directed graphs, which is uniformly strongly connected. For such communications, assuming that every node knows its out-degree, we develop a broadcast…
▽ More
We consider distributed optimization by a collection of nodes, each having access to its own convex function, whose collective goal is to minimize the sum of the functions. The communications between nodes are described by a time-varying sequence of directed graphs, which is uniformly strongly connected. For such communications, assuming that every node knows its out-degree, we develop a broadcast-based algorithm, termed the subgradient-push, which steers every node to an optimal value under a standard assumption of subgradient boundedness. The subgradient-push requires no knowledge of either the number of agents or the graph sequence to implement. Our analysis shows that the subgradient-push algorithm converges at a rate of $O(\ln(t)/\sqrt{t})$, where the constant depends on the initial values at the nodes, the subgradient norms, and, more interestingly, on both the consensus speed and the imbalances of influence among the nodes.
△ Less
Submitted 15 March, 2014; v1 submitted 10 March, 2013;
originally announced March 2013.
-
Graph diameter, eigenvalues, and minimum-time consensus
Authors:
Julien M. Hendrickx,
Raphaël M. Jungers,
Alexander Olshevsky,
Guillaume Vankeerberghen
Abstract:
We consider the problem of achieving average consensus in the minimum number of linear iterations on a fixed, undirected graph. We are motivated by the task of deriving lower bounds for consensus protocols and by the so-called "definitive consensus conjecture" which states that for an undirected connected graph G with diameter D there exist D matrices whose nonzero-pattern complies with the edges…
▽ More
We consider the problem of achieving average consensus in the minimum number of linear iterations on a fixed, undirected graph. We are motivated by the task of deriving lower bounds for consensus protocols and by the so-called "definitive consensus conjecture" which states that for an undirected connected graph G with diameter D there exist D matrices whose nonzero-pattern complies with the edges in G and whose product equals the all-ones matrix. Our first result is a counterexample to the definitive consensus conjecture, which is the first improvement of the diameter lower bound for linear consensus protocols. We then provide some algebraic conditions under which this conjecture holds, which we use to establish that all distance-regular graphs satisfy the definitive consensus conjecture.
△ Less
Submitted 29 August, 2013; v1 submitted 27 November, 2012;
originally announced November 2012.
-
Cooperative learning in multi-agent systems from intermittent measurements
Authors:
Naomi Ehrich Leonard,
Alex Olshevsky
Abstract:
Motivated by the problem of tracking a direction in a decentralized way, we consider the general problem of cooperative learning in multi-agent systems with time-varying connectivity and intermittent measurements. We propose a distributed learning protocol capable of learning an unknown vector $μ$ from noisy measurements made independently by autonomous nodes. Our protocol is completely distribute…
▽ More
Motivated by the problem of tracking a direction in a decentralized way, we consider the general problem of cooperative learning in multi-agent systems with time-varying connectivity and intermittent measurements. We propose a distributed learning protocol capable of learning an unknown vector $μ$ from noisy measurements made independently by autonomous nodes. Our protocol is completely distributed and able to cope with the time-varying, unpredictable, and noisy nature of inter-agent communication, and intermittent noisy measurements of $μ$. Our main result bounds the learning speed of our protocol in terms of the size and combinatorial features of the (time-varying) networks connecting the nodes.
△ Less
Submitted 15 December, 2014; v1 submitted 10 September, 2012;
originally announced September 2012.
-
NP-hardness of Deciding Convexity of Quartic Polynomials and Related Problems
Authors:
Amir Ali Ahmadi,
Alex Olshevsky,
Pablo A. Parrilo,
John N. Tsitsiklis
Abstract:
We show that unless P=NP, there exists no polynomial time (or even pseudo-polynomial time) algorithm that can decide whether a multivariate polynomial of degree four (or higher even degree) is globally convex. This solves a problem that has been open since 1992 when N. Z. Shor asked for the complexity of deciding convexity for quartic polynomials. We also prove that deciding strict convexity, stro…
▽ More
We show that unless P=NP, there exists no polynomial time (or even pseudo-polynomial time) algorithm that can decide whether a multivariate polynomial of degree four (or higher even degree) is globally convex. This solves a problem that has been open since 1992 when N. Z. Shor asked for the complexity of deciding convexity for quartic polynomials. We also prove that deciding strict convexity, strong convexity, quasiconvexity, and pseudoconvexity of polynomials of even degree four or higher is strongly NP-hard. By contrast, we show that quasiconvexity and pseudoconvexity of odd degree polynomials can be decided in polynomial time.
△ Less
Submitted 8 December, 2010;
originally announced December 2010.
-
Distributed anonymous discrete function computation
Authors:
Julien M. Hendrickx,
Alex Olshevsky,
John N. Tsitsiklis
Abstract:
We propose a model for deterministic distributed function computation by a network of identical and anonymous nodes. In this model, each node has bounded computation and storage capabilities that do not grow with the network size. Furthermore, each node only knows its neighbors, not the entire graph. Our goal is to characterize the class of functions that can be computed within this model. In our…
▽ More
We propose a model for deterministic distributed function computation by a network of identical and anonymous nodes. In this model, each node has bounded computation and storage capabilities that do not grow with the network size. Furthermore, each node only knows its neighbors, not the entire graph. Our goal is to characterize the class of functions that can be computed within this model. In our main result, we provide a necessary condition for computability which we show to be nearly sufficient, in the sense that every function that satisfies this condition can at least be approximated. The problem of computing suitably rounded averages in a distributed manner plays a central role in our development; we provide an algorithm that solves it in time that grows quadratically with the size of the network.
△ Less
Submitted 25 June, 2011; v1 submitted 12 April, 2010;
originally announced April 2010.
-
Matrix P-norms are NP-hard to approximate if p \neq 1,2,\infty
Authors:
Julien M. Hendrickx,
Alex Olshevsky
Abstract:
We show that for any rational p \in [1,\infty) except p = 1, 2, unless P = NP, there is no polynomial-time algorithm for approximating the matrix p-norm to arbitrary relative precision. We also show that for any rational p\in [1,\infty) including p = 1, 2, unless P = NP, there is no polynomial-time algorithm approximates the \infty, p mixed norm to some fixed relative precision.
We show that for any rational p \in [1,\infty) except p = 1, 2, unless P = NP, there is no polynomial-time algorithm for approximating the matrix p-norm to arbitrary relative precision. We also show that for any rational p\in [1,\infty) including p = 1, 2, unless P = NP, there is no polynomial-time algorithm approximates the \infty, p mixed norm to some fixed relative precision.
△ Less
Submitted 23 April, 2010; v1 submitted 10 August, 2009;
originally announced August 2009.
-
Distributed anonymous function computation in information fusion and multiagent systems
Authors:
Julien M. Hendrickx,
Alex Olshevsky,
John N. Tsitsiklis
Abstract:
We propose a model for deterministic distributed function computation by a network of identical and anonymous nodes, with bounded computation and storage capabilities that do not scale with the network size. Our goal is to characterize the class of functions that can be computed within this model. In our main result, we exhibit a class of non-computable functions, and prove that every function o…
▽ More
We propose a model for deterministic distributed function computation by a network of identical and anonymous nodes, with bounded computation and storage capabilities that do not scale with the network size. Our goal is to characterize the class of functions that can be computed within this model. In our main result, we exhibit a class of non-computable functions, and prove that every function outside this class can at least be approximated. The problem of computing averages in a distributed manner plays a central role in our development.
△ Less
Submitted 28 July, 2009; v1 submitted 16 July, 2009;
originally announced July 2009.