-
Nearly Optimal Latent State Decoding in Block MDPs
Authors:
Yassir Jedra,
Junghyun Lee,
Alexandre Proutière,
Se-Young Yun
Abstract:
We investigate the problems of model estimation and reward-free learning in episodic Block MDPs. In these MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are first interested in estimating the latent state decoding function (the map** from the observations to latent states) based on data generated under a fixed behavior poli…
▽ More
We investigate the problems of model estimation and reward-free learning in episodic Block MDPs. In these MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are first interested in estimating the latent state decoding function (the map** from the observations to latent states) based on data generated under a fixed behavior policy. We derive an information-theoretical lower bound on the error rate for estimating this function and present an algorithm approaching this fundamental limit. In turn, our algorithm also provides estimates of all the components of the MDP. We then study the problem of learning near-optimal policies in the reward-free framework. Based on our efficient model estimation algorithm, we show that we can infer a policy converging (as the number of collected samples grows large) to the optimal policy at the best possible rate. Interestingly, our analysis provides necessary and sufficient conditions under which exploiting the block structure yields improvements in the sample complexity for identifying near-optimal policies. When these conditions are met, the sample complexity in the minimax reward-free setting is improved by a multiplicative factor $n$, where $n$ is the number of possible contexts.
△ Less
Submitted 24 February, 2023; v1 submitted 17 August, 2022;
originally announced August 2022.
-
Minimal Expected Regret in Linear Quadratic Control
Authors:
Yassir Jedra,
Alexandre Proutiere
Abstract:
We consider the problem of online learning in Linear Quadratic Control systems whose state transition and state-action transition matrices $A$ and $B$ may be initially unknown. We devise an online learning algorithm and provide guarantees on its expected regret. This regret at time $T$ is upper bounded (i) by $\widetilde{O}((d_u+d_x)\sqrt{d_xT})$ when $A$ and $B$ are unknown, (ii) by…
▽ More
We consider the problem of online learning in Linear Quadratic Control systems whose state transition and state-action transition matrices $A$ and $B$ may be initially unknown. We devise an online learning algorithm and provide guarantees on its expected regret. This regret at time $T$ is upper bounded (i) by $\widetilde{O}((d_u+d_x)\sqrt{d_xT})$ when $A$ and $B$ are unknown, (ii) by $\widetilde{O}(d_x^2\log(T))$ if only $A$ is unknown, and (iii) by $\widetilde{O}(d_x(d_u+d_x)\log(T))$ if only $B$ is unknown and under some mild non-degeneracy condition ($d_x$ and $d_u$ denote the dimensions of the state and of the control input, respectively). These regret scalings are minimal in $T$, $d_x$ and $d_u$ as they match existing lower bounds in scenario (i) when $d_x\le d_u$ [SF20], and in scenario (ii) [lai1986]. We conjecture that our upper bounds are also optimal in scenario (iii) (there is no known lower bound in this setting).
Existing online algorithms proceed in epochs of (typically exponentially) growing durations. The control policy is fixed within each epoch, which considerably simplifies the analysis of the estimation error on $A$ and $B$ and hence of the regret. Our algorithm departs from this design choice: it is a simple variant of certainty-equivalence regulators, where the estimates of $A$ and $B$ and the resulting control policy can be updated as frequently as we wish, possibly at every step. Quantifying the impact of such a constantly-varying control policy on the performance of these estimates and on the regret constitutes one of the technical challenges tackled in this paper.
△ Less
Submitted 29 September, 2021;
originally announced September 2021.
-
Finite-time Identification of Stable Linear Systems: Optimality of the Least-Squares Estimator
Authors:
Yassir Jedra,
Alexandre Proutiere
Abstract:
We present a new finite-time analysis of the estimation error of the Ordinary Least Squares (OLS) estimator for stable linear time-invariant systems. We characterize the number of observed samples (the length of the observed trajectory) sufficient for the OLS estimator to be $(\varepsilon,δ)$-PAC, i.e., to yield an estimation error less than $\varepsilon$ with probability at least $1-δ$. We show t…
▽ More
We present a new finite-time analysis of the estimation error of the Ordinary Least Squares (OLS) estimator for stable linear time-invariant systems. We characterize the number of observed samples (the length of the observed trajectory) sufficient for the OLS estimator to be $(\varepsilon,δ)$-PAC, i.e., to yield an estimation error less than $\varepsilon$ with probability at least $1-δ$. We show that this number matches existing sample complexity lower bounds [1,2] up to universal multiplicative factors (independent of ($\varepsilon,δ)$ and of the system). This paper hence establishes the optimality of the OLS estimator for stable systems, a result conjectured in [1]. Our analysis of the performance of the OLS estimator is simpler, sharper, and easier to interpret than existing analyses. It relies on new concentration results for the covariates matrix.
△ Less
Submitted 26 March, 2020; v1 submitted 17 March, 2020;
originally announced March 2020.
-
Distributed Algorithms that Solve Boolean Equations with Local and Differential Privacies
Authors:
Hongsheng Qi,
Bo Li,
Rui-Juan **g,
Lei Wang,
Alexandre Proutiere,
Guodong Shi
Abstract:
In this paper, we propose distributed algorithms that solve a system of Boolean equations over a network, where each node in the network possesses only one Boolean equation from the system. The Boolean equation assigned at any particular node is a {\em private} equation known to this node only, and the nodes aim to compute the exact set of solutions to the system without exchanging their local equ…
▽ More
In this paper, we propose distributed algorithms that solve a system of Boolean equations over a network, where each node in the network possesses only one Boolean equation from the system. The Boolean equation assigned at any particular node is a {\em private} equation known to this node only, and the nodes aim to compute the exact set of solutions to the system without exchanging their local equations. We show that each private Boolean equation can be locally lifted to a linear algebraic equation under a basis of Boolean vectors, leading to a network linear equation that is distributedly solvable using existing distributed linear equation algorithms as a subroutine. A number of exact or approximate solutions to the induced linear equation are then computed at each node from different initial values. The solutions to the original Boolean equations are eventually computed locally via a Boolean vector search algorithm. We prove that given solvable Boolean equations, when the initial values of the nodes for the distributed linear equation solving step are i.i.d selected according to a uniform distribution in a high-dimensional cube, our algorithms return the exact solution set of the Boolean equations at each node with high probability. Furthermore, we present an algorithm for distributed verification of the satisfiability of Boolean equations, and prove its correctness. Finally, we show that by utilizing linear equation solvers with differential privacy to replace the in-network computing routines, the overall distributed Boolean equation algorithms can be made differentially private. Under the standard Laplace mechanism, we prove an explicit level of noises that can be injected in the linear equation steps for ensuring a prescribed level of differential privacy.
△ Less
Submitted 3 March, 2021; v1 submitted 19 February, 2020;
originally announced February 2020.
-
Distributed Online Optimization with Long-Term Constraints
Authors:
Deming Yuan,
Alexandre Proutiere,
Guodong Shi
Abstract:
We consider distributed online convex optimization problems, where the distributed system consists of various computing units connected through a time-varying communication graph. In each time step, each computing unit selects a constrained vector, experiences a loss equal to an arbitrary convex function evaluated at this vector, and may communicate to its neighbors in the graph. The objective is…
▽ More
We consider distributed online convex optimization problems, where the distributed system consists of various computing units connected through a time-varying communication graph. In each time step, each computing unit selects a constrained vector, experiences a loss equal to an arbitrary convex function evaluated at this vector, and may communicate to its neighbors in the graph. The objective is to minimize the system-wide loss accumulated over time. We propose a decentralized algorithm with regret and cumulative constraint violation in $\mathcal{O}(T^{\max\{c,1-c\} })$ and $\mathcal{O}(T^{1-c/2})$, respectively, for any $c\in (0,1)$, where $T$ is the time horizon. When the loss functions are strongly convex, we establish improved regret and constraint violation upper bounds in $\mathcal{O}(\log(T))$ and $\mathcal{O}(\sqrt{T\log(T)})$. These regret scalings match those obtained by state-of-the-art algorithms and fundamental limits in the corresponding centralized online optimization problem (for both convex and strongly convex loss functions). In the case of bandit feedback, the proposed algorithms achieve a regret and constraint violation in $\mathcal{O}(T^{\max\{c,1-c/3 \} })$ and $\mathcal{O}(T^{1-c/2})$ for any $c\in (0,1)$. We numerically illustrate the performance of our algorithms for the particular case of distributed online regularized linear regression problems.
△ Less
Submitted 20 December, 2019;
originally announced December 2019.
-
From self-tuning regulators to reinforcement learning and back again
Authors:
Nikolai Matni,
Alexandre Proutiere,
Anders Rantzer,
Stephen Tu
Abstract:
Machine and reinforcement learning (RL) are increasingly being applied to plan and control the behavior of autonomous systems interacting with the physical world. Examples include self-driving vehicles, distributed sensor networks, and agile robots. However, when machine learning is to be applied in these new settings, the algorithms had better come with the same type of reliability, robustness, a…
▽ More
Machine and reinforcement learning (RL) are increasingly being applied to plan and control the behavior of autonomous systems interacting with the physical world. Examples include self-driving vehicles, distributed sensor networks, and agile robots. However, when machine learning is to be applied in these new settings, the algorithms had better come with the same type of reliability, robustness, and safety bounds that are hallmarks of control theory, or failures could be catastrophic. Thus, as learning algorithms are increasingly and more aggressively deployed in safety critical settings, it is imperative that control theorists join the conversation. The goal of this tutorial paper is to provide a starting point for control theorists wishing to work on learning related problems, by covering recent advances bridging learning and control theory, and by placing these results within an appropriate historical context of system identification and adaptive control.
△ Less
Submitted 22 September, 2019; v1 submitted 26 June, 2019;
originally announced June 2019.
-
Distributed Online Linear Regression
Authors:
Deming Yuan,
Alexandre Proutiere,
Guodong Shi
Abstract:
We study online linear regression problems in a distributed setting, where the data is spread over a network. In each round, each network node proposes a linear predictor, with the objective of fitting the \emph{network-wide} data. It then updates its predictor for the next round according to the received local feedback and information received from neighboring nodes. The predictions made at a giv…
▽ More
We study online linear regression problems in a distributed setting, where the data is spread over a network. In each round, each network node proposes a linear predictor, with the objective of fitting the \emph{network-wide} data. It then updates its predictor for the next round according to the received local feedback and information received from neighboring nodes. The predictions made at a given node are assessed through the notion of regret, defined as the difference between their cumulative network-wide square errors and those of the best off-line network-wide linear predictor. Various scenarios are investigated, depending on the nature of the local feedback (full information or bandit feedback), on the set of available predictors (the decision set), and the way data is generated (by an oblivious or adaptive adversary). We propose simple and natural distributed regression algorithms, involving, at each node and in each round, a local gradient descent step and a communication and averaging step where nodes aim at aligning their predictors to those of their neighbors. We establish regret upper bounds typically in ${\cal O}(T^{3/4})$ when the decision set is unbounded and in ${\cal O}(\sqrt{T})$ in case of bounded decision set.
△ Less
Submitted 13 February, 2019;
originally announced February 2019.
-
Clustering in Block Markov Chains
Authors:
Jaron Sanders,
Alexandre Proutière,
Se-Young Yun
Abstract:
This paper considers cluster detection in Block Markov Chains (BMCs). These Markov chains are characterized by a block structure in their transition matrix. More precisely, the $n$ possible states are divided into a finite number of $K$ groups or clusters, such that states in the same cluster exhibit the same transition rates to other states. One observes a trajectory of the Markov chain, and the…
▽ More
This paper considers cluster detection in Block Markov Chains (BMCs). These Markov chains are characterized by a block structure in their transition matrix. More precisely, the $n$ possible states are divided into a finite number of $K$ groups or clusters, such that states in the same cluster exhibit the same transition rates to other states. One observes a trajectory of the Markov chain, and the objective is to recover, from this observation only, the (initially unknown) clusters. In this paper we devise a clustering procedure that accurately, efficiently, and provably detects the clusters. We first derive a fundamental information-theoretical lower bound on the detection error rate satisfied under any clustering algorithm. This bound identifies the parameters of the BMC, and trajectory lengths, for which it is possible to accurately detect the clusters. We next develop two clustering algorithms that can together accurately recover the cluster structure from the shortest possible trajectories, whenever the parameters allow detection. These algorithms thus reach the fundamental detectability limit, and are optimal in that sense.
△ Less
Submitted 29 July, 2019; v1 submitted 26 December, 2017;
originally announced December 2017.
-
Minimal Exploration in Structured Stochastic Bandits
Authors:
Richard Combes,
Stefan Magureanu,
Alexandre Proutiere
Abstract:
This paper introduces and addresses a wide class of stochastic bandit problems where the function map** the arm to the corresponding reward exhibits some known structural properties. Most existing structures (e.g. linear, Lipschitz, unimodal, combinatorial, dueling, ...) are covered by our framework. We derive an asymptotic instance-specific regret lower bound for these problems, and develop OSS…
▽ More
This paper introduces and addresses a wide class of stochastic bandit problems where the function map** the arm to the corresponding reward exhibits some known structural properties. Most existing structures (e.g. linear, Lipschitz, unimodal, combinatorial, dueling, ...) are covered by our framework. We derive an asymptotic instance-specific regret lower bound for these problems, and develop OSSB, an algorithm whose regret matches this fundamental limit. OSSB is not based on the classical principle of "optimism in the face of uncertainty" or on Thompson sampling, and rather aims at matching the minimal exploration rates of sub-optimal arms as characterized in the derivation of the regret lower bound. We illustrate the efficiency of OSSB using numerical experiments in the case of the linear bandit problem and show that OSSB outperforms existing algorithms, including Thompson sampling.
△ Less
Submitted 1 November, 2017;
originally announced November 2017.
-
Optimal Cluster Recovery in the Labeled Stochastic Block Model
Authors:
Se-Young Yun,
Alexandre Proutiere
Abstract:
We consider the problem of community detection or clustering in the labeled Stochastic Block Model (LSBM) with a finite number $K$ of clusters of sizes linearly growing with the global population of items $n$. Every pair of items is labeled independently at random, and label $\ell$ appears with probability $p(i,j,\ell)$ between two items in clusters indexed by $i$ and $j$, respectively. The object…
▽ More
We consider the problem of community detection or clustering in the labeled Stochastic Block Model (LSBM) with a finite number $K$ of clusters of sizes linearly growing with the global population of items $n$. Every pair of items is labeled independently at random, and label $\ell$ appears with probability $p(i,j,\ell)$ between two items in clusters indexed by $i$ and $j$, respectively. The objective is to reconstruct the clusters from the observation of these random labels.
Clustering under the SBM and their extensions has attracted much attention recently. Most existing work aimed at characterizing the set of parameters such that it is possible to infer clusters either positively correlated with the true clusters, or with a vanishing proportion of misclassified items, or exactly matching the true clusters. We find the set of parameters such that there exists a clustering algorithm with at most $s$ misclassified items in average under the general LSBM and for any $s=o(n)$, which solves one open problem raised in \cite{abbe2015community}. We further develop an algorithm, based on simple spectral methods, that achieves this fundamental performance limit within $O(n \mbox{polylog}(n))$ computations and without the a-priori knowledge of the model parameters.
△ Less
Submitted 21 May, 2016; v1 submitted 20 October, 2015;
originally announced October 2015.
-
Streaming, Memory Limited Matrix Completion with Noise
Authors:
Se-Young Yun,
Marc Lelarge,
Alexandre Proutiere
Abstract:
In this paper, we consider the streaming memory-limited matrix completion problem when the observed entries are noisy versions of a small random fraction of the original entries. We are interested in scenarios where the matrix size is very large so the matrix is very hard to store and manipulate. Here, columns of the observed matrix are presented sequentially and the goal is to complete the missin…
▽ More
In this paper, we consider the streaming memory-limited matrix completion problem when the observed entries are noisy versions of a small random fraction of the original entries. We are interested in scenarios where the matrix size is very large so the matrix is very hard to store and manipulate. Here, columns of the observed matrix are presented sequentially and the goal is to complete the missing entries after one pass on the data with limited memory space and limited computational complexity. We propose a streaming algorithm which produces an estimate of the original matrix with a vanishing mean square error, uses memory space scaling linearly with the ambient dimension of the matrix, i.e. the memory required to store the output alone, and spends computations as much as the number of non-zero entries of the input matrix.
△ Less
Submitted 13 April, 2015;
originally announced April 2015.
-
Combinatorial Bandits Revisited
Authors:
Richard Combes,
M. Sadegh Talebi,
Alexandre Proutiere,
Marc Lelarge
Abstract:
This paper investigates stochastic and adversarial combinatorial multi-armed bandit problems. In the stochastic setting under semi-bandit feedback, we derive a problem-specific regret lower bound, and discuss its scaling with the dimension of the decision space. We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ES…
▽ More
This paper investigates stochastic and adversarial combinatorial multi-armed bandit problems. In the stochastic setting under semi-bandit feedback, we derive a problem-specific regret lower bound, and discuss its scaling with the dimension of the decision space. We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ESCB has better performance guarantees than existing algorithms, and significantly outperforms these algorithms in practice. In the adversarial setting under bandit feedback, we propose \textsc{CombEXP}, an algorithm with the same regret scaling as state-of-the-art algorithms, but with lower computational complexity for some combinatorial problems.
△ Less
Submitted 5 November, 2015; v1 submitted 11 February, 2015;
originally announced February 2015.
-
Stochastic Online Shortest Path Routing: The Value of Feedback
Authors:
M. Sadegh Talebi,
Zhenhua Zou,
Richard Combes,
Alexandre Proutiere,
Mikael Johansson
Abstract:
This paper studies online shortest path routing over multi-hop networks. Link costs or delays are time-varying and modeled by independent and identically distributed random processes, whose parameters are initially unknown. The parameters, and hence the optimal path, can only be estimated by routing packets through the network and observing the realized delays. Our aim is to find a routing policy…
▽ More
This paper studies online shortest path routing over multi-hop networks. Link costs or delays are time-varying and modeled by independent and identically distributed random processes, whose parameters are initially unknown. The parameters, and hence the optimal path, can only be estimated by routing packets through the network and observing the realized delays. Our aim is to find a routing policy that minimizes the regret (the cumulative difference of expected delay) between the path chosen by the policy and the unknown optimal path. We formulate the problem as a combinatorial bandit optimization problem and consider several scenarios that differ in where routing decisions are made and in the information available when making the decisions. For each scenario, we derive a tight asymptotic lower bound on the regret that has to be satisfied by any online routing policy. These bounds help us to understand the performance improvements we can expect when (i) taking routing decisions at each hop rather than at the source only, and (ii) observing per-link delays rather than end-to-end path delays. In particular, we show that (i) is of no use while (ii) can have a spectacular impact. Three algorithms, with a trade-off between computational complexity and performance, are proposed. The regret upper bounds of these algorithms improve over those of the existing algorithms, and they significantly outperform state-of-the-art algorithms in numerical experiments.
△ Less
Submitted 18 January, 2017; v1 submitted 27 September, 2013;
originally announced September 2013.
-
Randomized Consensus with Attractive and Repulsive Links
Authors:
Guodong Shi,
Alexandre Proutiere,
Mikael Johansson,
Karl H. Johansson
Abstract:
We study convergence properties of a randomized consensus algorithm over a graph with both attractive and repulsive links. At each time instant, a node is randomly selected to interact with a random neighbor. Depending on if the link between the two nodes belongs to a given subgraph of attractive or repulsive links, the node update follows a standard attractive weighted average or a repulsive weig…
▽ More
We study convergence properties of a randomized consensus algorithm over a graph with both attractive and repulsive links. At each time instant, a node is randomly selected to interact with a random neighbor. Depending on if the link between the two nodes belongs to a given subgraph of attractive or repulsive links, the node update follows a standard attractive weighted average or a repulsive weighted average, respectively. The repulsive update has the opposite sign of the standard consensus update. In this way, it counteracts the consensus formation and can be seen as a model of link faults or malicious attacks in a communication network, or the impact of trust and antagonism in a social network. Various probabilistic convergence and divergence conditions are established. A threshold condition for the strength of the repulsive action is given for convergence in expectation: when the repulsive weight crosses this threshold value, the algorithm transits from convergence to divergence. An explicit value of the threshold is derived for classes of attractive and repulsive graphs. The results show that a single repulsive link can sometimes drastically change the behavior of the consensus algorithm. They also explicitly show how the robustness of the consensus algorithm depends on the size and other properties of the graphs.
△ Less
Submitted 9 September, 2013;
originally announced September 2013.
-
Spectrum Bandit Optimization
Authors:
Marc Lelarge,
Alexandre Proutiere,
M. Sadegh Talebi
Abstract:
We consider the problem of allocating radio channels to links in a wireless network. Links interact through interference, modelled as a conflict graph (i.e., two interfering links cannot be simultaneously active on the same channel). We aim at identifying the channel allocation maximizing the total network throughput over a finite time horizon. Should we know the average radio conditions on each c…
▽ More
We consider the problem of allocating radio channels to links in a wireless network. Links interact through interference, modelled as a conflict graph (i.e., two interfering links cannot be simultaneously active on the same channel). We aim at identifying the channel allocation maximizing the total network throughput over a finite time horizon. Should we know the average radio conditions on each channel and on each link, an optimal allocation would be obtained by solving an Integer Linear Program (ILP). When radio conditions are unknown a priori, we look for a sequential channel allocation policy that converges to the optimal allocation while minimizing on the way the throughput loss or {\it regret} due to the need for exploring sub-optimal allocations. We formulate this problem as a generic linear bandit problem, and analyze it first in a stochastic setting where radio conditions are driven by a stationary stochastic process, and then in an adversarial setting where radio conditions can evolve arbitrarily. We provide new algorithms in both settings and derive upper bounds on their regrets.
△ Less
Submitted 17 February, 2015; v1 submitted 27 February, 2013;
originally announced February 2013.
-
Distributed Optimization: Convergence Conditions from a Dynamical System Perspective
Authors:
Guodong Shi,
Alexandre Proutiere,
Karl Henrik Johansson
Abstract:
This paper explores the fundamental properties of distributed minimization of a sum of functions with each function only known to one node, and a pre-specified level of node knowledge and computational capacity. We define the optimization information each node receives from its objective function, the neighboring information each node receives from its neighbors, and the computational capacity eac…
▽ More
This paper explores the fundamental properties of distributed minimization of a sum of functions with each function only known to one node, and a pre-specified level of node knowledge and computational capacity. We define the optimization information each node receives from its objective function, the neighboring information each node receives from its neighbors, and the computational capacity each node can take advantage of in controlling its state. It is proven that there exist a neighboring information way and a control law that guarantee global optimal consensus if and only if the solution sets of the local objective functions admit a nonempty intersection set for fixed strongly connected graphs. Then we show that for any tolerated error, we can find a control law that guarantees global optimal consensus within this error for fixed, bidirectional, and connected graphs under mild conditions. For time-varying graphs, we show that optimal consensus can always be achieved as long as the graph is uniformly jointly strongly connected and the nonempty intersection condition holds. The results illustrate that nonempty intersection for the local optimal solution sets is a critical condition for successful distributed optimization for a large class of algorithms.
△ Less
Submitted 24 October, 2012;
originally announced October 2012.
-
A particle system in interaction with a rapidly varying environment: Mean field limits and applications
Authors:
Charles Bordenave,
David McDonald,
Alexandre Proutiere
Abstract:
We study an interacting particle system whose dynamics depends on an interacting random environment. As the number of particles grows large, the transition rate of the particles slows down (perhaps because they share a common resource of fixed capacity). The transition rate of a particle is determined by its state, by the empirical distribution of all the particles and by a rapidly varying envir…
▽ More
We study an interacting particle system whose dynamics depends on an interacting random environment. As the number of particles grows large, the transition rate of the particles slows down (perhaps because they share a common resource of fixed capacity). The transition rate of a particle is determined by its state, by the empirical distribution of all the particles and by a rapidly varying environment. The transitions of the environment are determined by the empirical distribution of the particles. We prove the propagation of chaos on the path space of the particles and establish that the limiting trajectory of the empirical measure of the states of the particles satisfies a deterministic differential equation. This deterministic differential equation involves the time averages of the environment process.
We apply our results to analyze the performance of communication networks where users access some resources using random distributed multi-access algorithms. For these networks, we show that the environment process corresponds to a process describing the number of clients in a certain loss network, which allows us provide simple and explicit expressions of the network performance.
△ Less
Submitted 16 February, 2009; v1 submitted 12 January, 2007;
originally announced January 2007.