-
Differentiable Cost-Parameterized Monge Map Estimators
Authors:
Samuel Howard,
George Deligiannidis,
Patrick Rebeschini,
James Thornton
Abstract:
Within the field of optimal transport (OT), the choice of ground cost is crucial to ensuring that the optimality of a transport map corresponds to usefulness in real-world applications. It is therefore desirable to use known information to tailor cost functions and hence learn OT maps which are adapted to the problem at hand. By considering a class of neural ground costs whose Monge maps have a kn…
▽ More
Within the field of optimal transport (OT), the choice of ground cost is crucial to ensuring that the optimality of a transport map corresponds to usefulness in real-world applications. It is therefore desirable to use known information to tailor cost functions and hence learn OT maps which are adapted to the problem at hand. By considering a class of neural ground costs whose Monge maps have a known form, we construct a differentiable Monge map estimator which can be optimized to be consistent with known information about an OT map. In doing so, we simultaneously learn both an OT map estimator and a corresponding adapted cost function. Through suitable choices of loss function, our method provides a general approach for incorporating prior information about the Monge map itself when learning adapted OT maps and cost functions.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Learning mirror maps in policy mirror descent
Authors:
Carlo Alfano,
Sebastian Towers,
Silvia Sapora,
Chris Lu,
Patrick Rebeschini
Abstract:
Policy Mirror Descent (PMD) is a popular framework in reinforcement learning, serving as a unifying perspective that encompasses numerous algorithms. These algorithms are derived through the selection of a mirror map and enjoy finite-time convergence guarantees. Despite its popularity, the exploration of PMD's full potential is limited, with the majority of research focusing on a particular mirror…
▽ More
Policy Mirror Descent (PMD) is a popular framework in reinforcement learning, serving as a unifying perspective that encompasses numerous algorithms. These algorithms are derived through the selection of a mirror map and enjoy finite-time convergence guarantees. Despite its popularity, the exploration of PMD's full potential is limited, with the majority of research focusing on a particular mirror map -- namely, the negative entropy -- which gives rise to the renowned Natural Policy Gradient (NPG) method. It remains uncertain from existing theoretical studies whether the choice of mirror map significantly influences PMD's efficacy. In our work, we conduct empirical investigations to show that the conventional mirror map choice (NPG) often yields less-than-optimal outcomes across several standard benchmark environments. Using evolutionary strategies, we identify more efficient mirror maps that enhance the performance of PMD. We first focus on a tabular environment, i.e. Grid-World, where we relate existing theoretical bounds with the performance of PMD for a few standard mirror maps and the learned one. We then show that it is possible to learn a mirror map that outperforms the negative entropy in more complex environments, such as the MinAtar suite. Our results suggest that mirror maps generalize well across various environments, raising questions about how to best match a mirror map to an environment's structure and characteristics.
△ Less
Submitted 7 June, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
Generalization Bounds for Label Noise Stochastic Gradient Descent
Authors:
Jung Eun Huh,
Patrick Rebeschini
Abstract:
We develop generalization error bounds for stochastic gradient descent (SGD) with label noise in non-convex settings under uniform dissipativity and smoothness conditions. Under a suitable choice of semimetric, we establish a contraction in Wasserstein distance of the label noise stochastic gradient flow that depends polynomially on the parameter dimension $d$. Using the framework of algorithmic s…
▽ More
We develop generalization error bounds for stochastic gradient descent (SGD) with label noise in non-convex settings under uniform dissipativity and smoothness conditions. Under a suitable choice of semimetric, we establish a contraction in Wasserstein distance of the label noise stochastic gradient flow that depends polynomially on the parameter dimension $d$. Using the framework of algorithmic stability, we derive time-independent generalisation error bounds for the discretized algorithm with a constant learning rate. The error bound we achieve scales polynomially with $d$ and with the rate of $n^{-2/3}$, where $n$ is the sample size. This rate is better than the best-known rate of $n^{-1/2}$ established for stochastic gradient Langevin dynamics (SGLD) -- which employs parameter-independent Gaussian noise -- under similar conditions. Our analysis offers quantitative insights into the effect of label noise.
△ Less
Submitted 31 October, 2023;
originally announced November 2023.
-
A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence
Authors:
Carlo Alfano,
Rui Yuan,
Patrick Rebeschini
Abstract:
Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially in the tabular setting, the use of general parameterization schemes remains mostly unjustified. In this work, we introduce a novel framework for policy optimiz…
▽ More
Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially in the tabular setting, the use of general parameterization schemes remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parameterizations. The policy class induced by our scheme recovers known classes, e.g., softmax, and generates new ones depending on the choice of mirror map. Using our framework, we obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization. To demonstrate the ability of our framework to accommodate general parameterization schemes, we provide its sample complexity when using shallow neural networks, show that it represents an improvement upon the previous best results, and empirically validate the effectiveness of our theoretical claims on classic control tasks.
△ Less
Submitted 13 February, 2024; v1 submitted 30 January, 2023;
originally announced January 2023.
-
Exponential Tail Local Rademacher Complexity Risk Bounds Without the Bernstein Condition
Authors:
Varun Kanade,
Patrick Rebeschini,
Tomas Vaskevicius
Abstract:
The local Rademacher complexity framework is one of the most successful general-purpose toolboxes for establishing sharp excess risk bounds for statistical estimators based on the framework of empirical risk minimization. Applying this toolbox typically requires using the Bernstein condition, which often restricts applicability to convex and proper settings. Recent years have witnessed several exa…
▽ More
The local Rademacher complexity framework is one of the most successful general-purpose toolboxes for establishing sharp excess risk bounds for statistical estimators based on the framework of empirical risk minimization. Applying this toolbox typically requires using the Bernstein condition, which often restricts applicability to convex and proper settings. Recent years have witnessed several examples of problems where optimal statistical performance is only achievable via non-convex and improper estimators originating from aggregation theory, including the fundamental problem of model selection. These examples are currently outside of the reach of the classical localization theory.
In this work, we build upon the recent approach to localization via offset Rademacher complexities, for which a general high-probability theory has yet to be established. Our main result is an exponential-tail excess risk bound expressed in terms of the offset Rademacher complexity that yields results at least as sharp as those obtainable via the classical theory. However, our bound applies under an estimator-dependent geometric condition (the "offset condition") instead of the estimator-independent (but, in general, distribution-dependent) Bernstein condition on which the classical theory relies. Our results apply to improper prediction regimes not directly covered by the classical theory.
△ Less
Submitted 23 February, 2022;
originally announced February 2022.
-
Time-independent Generalization Bounds for SGLD in Non-convex Settings
Authors:
Tyler Farghly,
Patrick Rebeschini
Abstract:
We establish generalization error bounds for stochastic gradient Langevin dynamics (SGLD) with constant learning rate under the assumptions of dissipativity and smoothness, a setting that has received increased attention in the sampling/optimization literature. Unlike existing bounds for SGLD in non-convex settings, ours are time-independent and decay to zero as the sample size increases. Using th…
▽ More
We establish generalization error bounds for stochastic gradient Langevin dynamics (SGLD) with constant learning rate under the assumptions of dissipativity and smoothness, a setting that has received increased attention in the sampling/optimization literature. Unlike existing bounds for SGLD in non-convex settings, ours are time-independent and decay to zero as the sample size increases. Using the framework of uniform stability, we establish time-independent bounds by exploiting the Wasserstein contraction property of the Langevin diffusion, which also allows us to circumvent the need to bound gradients using Lipschitz-like assumptions. Our analysis also supports variants of SGLD that use different discretization methods, incorporate Euclidean projections, or use non-isotropic noise.
△ Less
Submitted 24 November, 2021;
originally announced November 2021.
-
On Optimal Interpolation In Linear Regression
Authors:
Eduard Oravkin,
Patrick Rebeschini
Abstract:
Understanding when and why interpolating methods generalize well has recently been a topic of interest in statistical learning theory. However, systematically connecting interpolating methods to achievable notions of optimality has only received partial attention. In this paper, we investigate the question of what is the optimal way to interpolate in linear regression using functions that are line…
▽ More
Understanding when and why interpolating methods generalize well has recently been a topic of interest in statistical learning theory. However, systematically connecting interpolating methods to achievable notions of optimality has only received partial attention. In this paper, we investigate the question of what is the optimal way to interpolate in linear regression using functions that are linear in the response variable (as the case for the Bayes optimal estimator in ridge regression) and depend on the data, the population covariance of the data, the signal-to-noise ratio and the covariance of the prior for the signal, but do not depend on the value of the signal itself nor the noise vector in the training data. We provide a closed-form expression for the interpolator that achieves this notion of optimality and show that it can be derived as the limit of preconditioned gradient descent with a specific initialization. We identify a regime where the minimum-norm interpolator provably generalizes arbitrarily worse than the optimal response-linear achievable interpolator that we introduce, and validate with numerical experiments that the notion of optimality we consider can be achieved by interpolating methods that only use the training data as input in the case of an isotropic prior. Finally, we extend the notion of optimal response-linear interpolation to random features regression under a linear data-generating model that has been previously studied in the literature.
△ Less
Submitted 21 October, 2021;
originally announced October 2021.
-
Dimension-Free Rates for Natural Policy Gradient in Multi-Agent Reinforcement Learning
Authors:
Carlo Alfano,
Patrick Rebeschini
Abstract:
Cooperative multi-agent reinforcement learning is a decentralized paradigm in sequential decision making where agents distributed over a network iteratively collaborate with neighbors to maximize global (network-wide) notions of rewards. Exact computations typically involve a complexity that scales exponentially with the number of agents. To address this curse of dimensionality, we design a scalab…
▽ More
Cooperative multi-agent reinforcement learning is a decentralized paradigm in sequential decision making where agents distributed over a network iteratively collaborate with neighbors to maximize global (network-wide) notions of rewards. Exact computations typically involve a complexity that scales exponentially with the number of agents. To address this curse of dimensionality, we design a scalable algorithm based on the Natural Policy Gradient framework that uses local information and only requires agents to communicate with neighbors within a certain range. Under standard assumptions on the spatial decay of correlations for the transition dynamics of the underlying Markov process and the localized learning policy, we show that our algorithm converges to the globally optimal policy with a dimension-free statistical and computational complexity, incurring a localization error that does not depend on the number of agents and converges to zero exponentially fast as a function of the range of communication.
△ Less
Submitted 23 September, 2021;
originally announced September 2021.
-
Comparing Classes of Estimators: When does Gradient Descent Beat Ridge Regression in Linear Models?
Authors:
Dominic Richards,
Edgar Dobriban,
Patrick Rebeschini
Abstract:
Methods for learning from data depend on various types of tuning parameters, such as penalization strength or step size. Since performance can depend strongly on these parameters, it is important to compare classes of estimators-by considering prescribed finite sets of tuning parameters-not just particularly tuned methods. In this work, we investigate classes of methods via the relative performanc…
▽ More
Methods for learning from data depend on various types of tuning parameters, such as penalization strength or step size. Since performance can depend strongly on these parameters, it is important to compare classes of estimators-by considering prescribed finite sets of tuning parameters-not just particularly tuned methods. In this work, we investigate classes of methods via the relative performance of the best method in the class. We consider the central problem of linear regression-with a random isotropic ground truth-and investigate the estimation performance of two fundamental methods, gradient descent and ridge regression. We unveil the following phenomena. (1) For general designs, constant stepsize gradient descent outperforms ridge regression when the eigenvalues of the empirical data covariance matrix decay slowly, as a power law with exponent less than unity. If instead the eigenvalues decay quickly, as a power law with exponent greater than unity or exponentially, we show that ridge regression outperforms gradient descent. (2) For orthogonal designs, we compute the exact minimax optimal class of estimators (achieving min-max-min optimality), showing it is equivalent to gradient descent with decaying learning rate. We find the sub-optimality of ridge regression and gradient descent with constant step size. Our results highlight that statistical performance can depend strongly on tuning parameters. In particular, while optimally tuned ridge regression is the best estimator in our setting, it can be outperformed by gradient descent by an arbitrary/unbounded amount when both methods are only tuned over finitely many regularization parameters.
△ Less
Submitted 12 June, 2022; v1 submitted 26 August, 2021;
originally announced August 2021.
-
Implicit Regularization in Matrix Sensing via Mirror Descent
Authors:
Fan Wu,
Patrick Rebeschini
Abstract:
We study discrete-time mirror descent applied to the unregularized empirical risk in matrix sensing. In both the general case of rectangular matrices and the particular case of positive semidefinite matrices, a simple potential-based analysis in terms of the Bregman divergence allows us to establish convergence of mirror descent -- with different choices of the mirror maps -- to a matrix that, amo…
▽ More
We study discrete-time mirror descent applied to the unregularized empirical risk in matrix sensing. In both the general case of rectangular matrices and the particular case of positive semidefinite matrices, a simple potential-based analysis in terms of the Bregman divergence allows us to establish convergence of mirror descent -- with different choices of the mirror maps -- to a matrix that, among all global minimizers of the empirical risk, minimizes a quantity explicitly related to the nuclear norm, the Frobenius norm, and the von Neumann entropy. In both cases, this characterization implies that mirror descent, a first-order algorithm minimizing the unregularized empirical risk, recovers low-rank matrices under the same set of assumptions that are sufficient to guarantee recovery for nuclear-norm minimization. When the sensing matrices are symmetric and commute, we show that gradient descent with full-rank factorized parametrization is a first-order approximation to mirror descent, in which case we obtain an explicit characterization of the implicit bias of gradient flow as a by-product.
△ Less
Submitted 27 October, 2021; v1 submitted 28 May, 2021;
originally announced May 2021.
-
Nearly Minimax-Optimal Rates for Noisy Sparse Phase Retrieval via Early-Stopped Mirror Descent
Authors:
Fan Wu,
Patrick Rebeschini
Abstract:
This paper studies early-stopped mirror descent applied to noisy sparse phase retrieval, which is the problem of recovering a $k$-sparse signal $\mathbf{x}^\star\in\mathbb{R}^n$ from a set of quadratic Gaussian measurements corrupted by sub-exponential noise. We consider the (non-convex) unregularized empirical risk minimization problem and show that early-stopped mirror descent, when equipped wit…
▽ More
This paper studies early-stopped mirror descent applied to noisy sparse phase retrieval, which is the problem of recovering a $k$-sparse signal $\mathbf{x}^\star\in\mathbb{R}^n$ from a set of quadratic Gaussian measurements corrupted by sub-exponential noise. We consider the (non-convex) unregularized empirical risk minimization problem and show that early-stopped mirror descent, when equipped with the hyperbolic entropy mirror map and proper initialization, achieves a nearly minimax-optimal rate of convergence, provided the sample size is at least of order $k^2$ (modulo logarithmic term) and the minimum (in modulus) non-zero entry of the signal is on the order of $\|\mathbf{x}^\star\|_2/\sqrt{k}$. Our theory leads to a simple algorithm that does not rely on explicit regularization or thresholding steps to promote sparsity. More generally, our results establish a connection between mirror descent and sparsity in the non-convex problem of noisy sparse phase retrieval, adding to the literature on early stop** that has mostly focused on non-sparse, Euclidean, and convex settings via gradient descent. Our proof combines a potential-based analysis of mirror descent with a quantitative control on a variational coherence property that we establish along the path of mirror descent, up to a prescribed stop** time.
△ Less
Submitted 8 May, 2021;
originally announced May 2021.
-
A Continuous-Time Mirror Descent Approach to Sparse Phase Retrieval
Authors:
Fan Wu,
Patrick Rebeschini
Abstract:
We analyze continuous-time mirror descent applied to sparse phase retrieval, which is the problem of recovering sparse signals from a set of magnitude-only measurements. We apply mirror descent to the unconstrained empirical risk minimization problem (batch setting), using the square loss and square measurements. We provide a convergence analysis of the algorithm in this non-convex setting and pro…
▽ More
We analyze continuous-time mirror descent applied to sparse phase retrieval, which is the problem of recovering sparse signals from a set of magnitude-only measurements. We apply mirror descent to the unconstrained empirical risk minimization problem (batch setting), using the square loss and square measurements. We provide a convergence analysis of the algorithm in this non-convex setting and prove that, with the hypentropy mirror map, mirror descent recovers any $k$-sparse vector $\mathbf{x}^\star\in\mathbb{R}^n$ with minimum (in modulus) non-zero entry on the order of $\| \mathbf{x}^\star \|_2/\sqrt{k}$ from $k^2$ Gaussian measurements, modulo logarithmic terms. This yields a simple algorithm which, unlike most existing approaches to sparse phase retrieval, adapts to the sparsity level, without including thresholding steps or adding regularization terms. Our results also provide a principled theoretical understanding for Hadamard Wirtinger flow [58], as Euclidean gradient descent applied to the empirical risk problem with Hadamard parametrization can be recovered as a first-order approximation to mirror descent in discrete time.
△ Less
Submitted 20 October, 2020;
originally announced October 2020.
-
Decentralised Learning with Random Features and Distributed Gradient Descent
Authors:
Dominic Richards,
Patrick Rebeschini,
Lorenzo Rosasco
Abstract:
We investigate the generalisation performance of Distributed Gradient Descent with Implicit Regularisation and Random Features in the homogenous setting where a network of agents are given data sampled independently from the same unknown distribution. Along with reducing the memory footprint, Random Features are particularly convenient in this setting as they provide a common parameterisation acro…
▽ More
We investigate the generalisation performance of Distributed Gradient Descent with Implicit Regularisation and Random Features in the homogenous setting where a network of agents are given data sampled independently from the same unknown distribution. Along with reducing the memory footprint, Random Features are particularly convenient in this setting as they provide a common parameterisation across agents that allows to overcome previous difficulties in implementing Decentralised Kernel Regression. Under standard source and capacity assumptions, we establish high probability bounds on the predictive performance for each agent as a function of the step size, number of iterations, inverse spectral gap of the communication matrix and number of Random Features. By tuning these parameters, we obtain statistical rates that are minimax optimal with respect to the total number of samples in the network. The algorithm provides a linear improvement over single machine Gradient Descent in memory cost and, when agents hold enough data with respect to the network size and inverse spectral gap, a linear speed-up in computational runtime for any network topology. We present simulations that show how the number of Random Features, iterations and samples impact predictive performance.
△ Less
Submitted 1 July, 2020;
originally announced July 2020.
-
Hadamard Wirtinger Flow for Sparse Phase Retrieval
Authors:
Fan Wu,
Patrick Rebeschini
Abstract:
We consider the problem of reconstructing an $n$-dimensional $k$-sparse signal from a set of noiseless magnitude-only measurements. Formulating the problem as an unregularized empirical risk minimization task, we study the sample complexity performance of gradient descent with Hadamard parametrization, which we call Hadamard Wirtinger flow (HWF). Provided knowledge of the signal sparsity $k$, we p…
▽ More
We consider the problem of reconstructing an $n$-dimensional $k$-sparse signal from a set of noiseless magnitude-only measurements. Formulating the problem as an unregularized empirical risk minimization task, we study the sample complexity performance of gradient descent with Hadamard parametrization, which we call Hadamard Wirtinger flow (HWF). Provided knowledge of the signal sparsity $k$, we prove that a single step of HWF is able to recover the support from $k(x^*_{max})^{-2}$ (modulo logarithmic term) samples, where $x^*_{max}$ is the largest component of the signal in magnitude. This support recovery procedure can be used to initialize existing reconstruction methods and yields algorithms with total runtime proportional to the cost of reading the data and improved sample complexity, which is linear in $k$ when the signal contains at least one large component. We numerically investigate the performance of HWF at convergence and show that, while not requiring any explicit form of regularization nor knowledge of $k$, HWF adapts to the signal sparsity and reconstructs sparse signals with fewer measurements than existing gradient based methods.
△ Less
Submitted 24 February, 2021; v1 submitted 1 June, 2020;
originally announced June 2020.
-
The Statistical Complexity of Early-Stopped Mirror Descent
Authors:
Tomas Vaškevičius,
Varun Kanade,
Patrick Rebeschini
Abstract:
Recently there has been a surge of interest in understanding implicit regularization properties of iterative gradient-based optimization algorithms. In this paper, we study the statistical guarantees on the excess risk achieved by early-stopped unconstrained mirror descent algorithms applied to the unregularized empirical risk with the squared loss for linear models and kernel methods. By completi…
▽ More
Recently there has been a surge of interest in understanding implicit regularization properties of iterative gradient-based optimization algorithms. In this paper, we study the statistical guarantees on the excess risk achieved by early-stopped unconstrained mirror descent algorithms applied to the unregularized empirical risk with the squared loss for linear models and kernel methods. By completing an inequality that characterizes convexity for the squared loss, we identify an intrinsic link between offset Rademacher complexities and potential-based convergence analysis of mirror descent methods. Our observation immediately yields excess risk guarantees for the path traced by the iterates of mirror descent in terms of offset complexities of certain function classes depending only on the choice of the mirror map, initialization point, step-size, and the number of iterations. We apply our theory to recover, in a clean and elegant manner via rather short proofs, some of the recent results in the implicit regularization literature, while also showing how to improve upon them in some settings.
△ Less
Submitted 27 August, 2020; v1 submitted 1 February, 2020;
originally announced February 2020.
-
Distributed Machine Learning with Sparse Heterogeneous Data
Authors:
Dominic Richards,
Sahand N. Negahban,
Patrick Rebeschini
Abstract:
Motivated by distributed machine learning settings such as Federated Learning, we consider the problem of fitting a statistical model across a distributed collection of heterogeneous data sets whose similarity structure is encoded by a graph topology. Precisely, we analyse the case where each node is associated with fitting a sparse linear model, and edges join two nodes if the difference of their…
▽ More
Motivated by distributed machine learning settings such as Federated Learning, we consider the problem of fitting a statistical model across a distributed collection of heterogeneous data sets whose similarity structure is encoded by a graph topology. Precisely, we analyse the case where each node is associated with fitting a sparse linear model, and edges join two nodes if the difference of their solutions is also sparse. We propose a method based on Basis Pursuit Denoising with a total variation penalty, and provide finite sample guarantees for sub-Gaussian design matrices. Taking the root of the tree as a reference node, we show that if the sparsity of the differences across nodes is smaller than the sparsity at the root, then recovery is successful with fewer samples than by solving the problems independently, or by using methods that rely on a large overlap in the signal supports, such as the group Lasso. We consider both the noiseless and noisy setting, and numerically investigate the performance of distributed methods based on Distributed Alternating Direction Methods of Multipliers (ADMM) and hyperspectral unmixing.
△ Less
Submitted 27 November, 2021; v1 submitted 3 December, 2019;
originally announced December 2019.
-
Implicit Regularization for Optimal Sparse Recovery
Authors:
Tomas Vaškevičius,
Varun Kanade,
Patrick Rebeschini
Abstract:
We investigate implicit regularization schemes for gradient descent methods applied to unpenalized least squares regression to solve the problem of reconstructing a sparse signal from an underdetermined system of linear measurements under the restricted isometry assumption. For a given parametrization yielding a non-convex optimization problem, we show that prescribed choices of initialization, st…
▽ More
We investigate implicit regularization schemes for gradient descent methods applied to unpenalized least squares regression to solve the problem of reconstructing a sparse signal from an underdetermined system of linear measurements under the restricted isometry assumption. For a given parametrization yielding a non-convex optimization problem, we show that prescribed choices of initialization, step size and stop** time yield a statistically and computationally optimal algorithm that achieves the minimax rate with the same cost required to read the data up to poly-logarithmic factors. Beyond minimax optimality, we show that our algorithm adapts to instance difficulty and yields a dimension-independent rate when the signal-to-noise ratio is high enough. Key to the computational efficiency of our method is an increasing step size scheme that adapts to refined estimates of the true solution. We validate our findings with numerical experiments and compare our algorithm against explicit $\ell_{1}$ penalization. Going from hard instances to easy ones, our algorithm is seen to undergo a phase transition, eventually matching least squares with an oracle knowledge of the true support.
△ Less
Submitted 11 September, 2019;
originally announced September 2019.
-
Optimal Statistical Rates for Decentralised Non-Parametric Regression with Linear Speed-Up
Authors:
Dominic Richards,
Patrick Rebeschini
Abstract:
We analyse the learning performance of Distributed Gradient Descent in the context of multi-agent decentralised non-parametric regression with the square loss function when i.i.d. samples are assigned to agents. We show that if agents hold sufficiently many samples with respect to the network size, then Distributed Gradient Descent achieves optimal statistical rates with a number of iterations tha…
▽ More
We analyse the learning performance of Distributed Gradient Descent in the context of multi-agent decentralised non-parametric regression with the square loss function when i.i.d. samples are assigned to agents. We show that if agents hold sufficiently many samples with respect to the network size, then Distributed Gradient Descent achieves optimal statistical rates with a number of iterations that scales, up to a threshold, with the inverse of the spectral gap of the gossip matrix divided by the number of samples owned by each agent raised to a problem-dependent power. The presence of the threshold comes from statistics. It encodes the existence of a "big data" regime where the number of required iterations does not depend on the network topology. In this regime, Distributed Gradient Descent achieves optimal statistical rates with the same order of iterations as gradient descent run with all the samples in the network. Provided the communication delay is sufficiently small, the distributed protocol yields a linear speed-up in runtime compared to the single-machine protocol. This is in contrast to decentralised optimisation algorithms that do not exploit statistics and only yield a linear speed-up in graphs where the spectral gap is bounded away from zero. Our results exploit the statistical concentration of quantities held by agents and shed new light on the interplay between statistics and communication in decentralised methods. Bounds are given in the standard non-parametric setting with source/capacity assumptions.
△ Less
Submitted 13 November, 2019; v1 submitted 8 May, 2019;
originally announced May 2019.
-
Decentralized Cooperative Stochastic Bandits
Authors:
David Martínez-Rubio,
Varun Kanade,
Patrick Rebeschini
Abstract:
We study a decentralized cooperative stochastic multi-armed bandit problem with $K$ arms on a network of $N$ agents. In our model, the reward distribution of each arm is the same for each agent and rewards are drawn independently across agents and time steps. In each round, each agent chooses an arm to play and subsequently sends a message to her neighbors. The goal is to minimize the overall regr…
▽ More
We study a decentralized cooperative stochastic multi-armed bandit problem with $K$ arms on a network of $N$ agents. In our model, the reward distribution of each arm is the same for each agent and rewards are drawn independently across agents and time steps. In each round, each agent chooses an arm to play and subsequently sends a message to her neighbors. The goal is to minimize the overall regret of the entire network. We design a fully decentralized algorithm that uses an accelerated consensus procedure to compute (delayed) estimates of the average of rewards obtained by all the agents for each arm, and then uses an upper confidence bound (UCB) algorithm that accounts for the delay and error of the estimates. We analyze the regret of our algorithm and also provide a lower bound. The regret is bounded by the optimal centralized regret plus a natural and simple term depending on the spectral gap of the communication matrix. Our algorithm is simpler to analyze than those proposed in prior work and it achieves better regret bounds, while requiring less information about the underlying network. It also performs better empirically.
△ Less
Submitted 24 October, 2019; v1 submitted 10 October, 2018;
originally announced October 2018.
-
Graph-Dependent Implicit Regularisation for Distributed Stochastic Subgradient Descent
Authors:
Dominic Richards,
Patrick Rebeschini
Abstract:
We propose graph-dependent implicit regularisation strategies for distributed stochastic subgradient descent (Distributed SGD) for convex problems in multi-agent learning. Under the standard assumptions of convexity, Lipschitz continuity, and smoothness, we establish statistical learning rates that retain, up to logarithmic terms, centralised statistical guarantees through implicit regularisation…
▽ More
We propose graph-dependent implicit regularisation strategies for distributed stochastic subgradient descent (Distributed SGD) for convex problems in multi-agent learning. Under the standard assumptions of convexity, Lipschitz continuity, and smoothness, we establish statistical learning rates that retain, up to logarithmic terms, centralised statistical guarantees through implicit regularisation (step size tuning and early stop**) with appropriate dependence on the graph topology. Our approach avoids the need for explicit regularisation in decentralised learning problems, such as adding constraints to the empirical risk minimisation rule. Particularly for distributed methods, the use of implicit regularisation allows the algorithm to remain simple, without projections or dual methods. To prove our results, we establish graph-independent generalisation bounds for Distributed SGD that match the centralised setting (using algorithmic stability), and we establish graph-dependent optimisation bounds that are of independent interest. We present numerical experiments to show that the qualitative nature of the upper bounds we derive can be representative of real behaviours.
△ Less
Submitted 18 September, 2018;
originally announced September 2018.
-
A New Approach to Laplacian Solvers and Flow Problems
Authors:
Patrick Rebeschini,
Sekhar Tatikonda
Abstract:
This paper investigates the behavior of the Min-Sum message passing scheme to solve systems of linear equations in the Laplacian matrices of graphs and to compute electric flows. Voltage and flow problems involve the minimization of quadratic functions and are fundamental primitives that arise in several domains. Algorithms that have been proposed are typically centralized and involve multiple gra…
▽ More
This paper investigates the behavior of the Min-Sum message passing scheme to solve systems of linear equations in the Laplacian matrices of graphs and to compute electric flows. Voltage and flow problems involve the minimization of quadratic functions and are fundamental primitives that arise in several domains. Algorithms that have been proposed are typically centralized and involve multiple graph-theoretic constructions or sampling mechanisms that make them difficult to implement and analyze. On the other hand, message passing routines are distributed, simple, and easy to implement. In this paper we establish a framework to analyze Min-Sum to solve voltage and flow problems. We characterize the error committed by the algorithm on general weighted graphs in terms of hitting times of random walks defined on the computation trees that support the operations of the algorithms with time. For $d$-regular graphs with equal weights, we show that the convergence of the algorithms is controlled by the total variation distance between the distributions of non-backtracking random walks defined on the original graph that start from neighboring nodes. The framework that we introduce extends the analysis of Min-Sum to settings where the contraction arguments previously considered in the literature (based on the assumption of walk summability or scaled diagonal dominance) can not be used, possibly in the presence of constraints.
△ Less
Submitted 7 March, 2019; v1 submitted 21 November, 2016;
originally announced November 2016.
-
Scale-free network optimization: foundations and algorithms
Authors:
Patrick Rebeschini,
Sekhar Tatikonda
Abstract:
We investigate the fundamental principles that drive the development of scalable algorithms for network optimization. Despite the significant amount of work on parallel and decentralized algorithms in the optimization community, the methods that have been proposed typically rely on strict separability assumptions for objective function and constraints. Beside sparsity, these methods typically do n…
▽ More
We investigate the fundamental principles that drive the development of scalable algorithms for network optimization. Despite the significant amount of work on parallel and decentralized algorithms in the optimization community, the methods that have been proposed typically rely on strict separability assumptions for objective function and constraints. Beside sparsity, these methods typically do not exploit the strength of the interaction between variables in the system. We propose a notion of correlation in constrained optimization that is based on the sensitivity of the optimal solution upon perturbations of the constraints. We develop a general theory of sensitivity of optimizers the extends beyond the infinitesimal setting. We present instances in network optimization where the correlation decays exponentially fast with respect to the natural distance in the network, and we design algorithms that can exploit this decay to yield dimension-free optimization. Our results are the first of their kind, and open new possibilities in the theory of local algorithms.
△ Less
Submitted 12 February, 2016;
originally announced February 2016.
-
Fast Mixing for Discrete Point Processes
Authors:
Patrick Rebeschini,
Amin Karbasi
Abstract:
We investigate the systematic mechanism for designing fast mixing Markov chain Monte Carlo algorithms to sample from discrete point processes under the Dobrushin uniqueness condition for Gibbs measures. Discrete point processes are defined as probability distributions $μ(S)\propto \exp(βf(S))$ over all subsets $S\in 2^V$ of a finite set $V$ through a bounded set function…
▽ More
We investigate the systematic mechanism for designing fast mixing Markov chain Monte Carlo algorithms to sample from discrete point processes under the Dobrushin uniqueness condition for Gibbs measures. Discrete point processes are defined as probability distributions $μ(S)\propto \exp(βf(S))$ over all subsets $S\in 2^V$ of a finite set $V$ through a bounded set function $f:2^V\rightarrow \mathbb{R}$ and a parameter $β>0$. A subclass of discrete point processes characterized by submodular functions (which include log-submodular distributions, submodular point processes, and determinantal point processes) has recently gained a lot of interest in machine learning and shown to be effective for modeling diversity and coverage. We show that if the set function (not necessarily submodular) displays a natural notion of decay of correlation, then, for $β$ small enough, it is possible to design fast mixing Markov chain Monte Carlo methods that yield error bounds on marginal approximations that do not depend on the size of the set $V$. The sufficient conditions that we derive involve a control on the (discrete) Hessian of set functions, a quantity that has not been previously considered in the literature. We specialize our results for submodular functions, and we discuss canonical examples where the Hessian can be easily controlled.
△ Less
Submitted 6 June, 2015;
originally announced June 2015.