-
A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound
Authors:
Gal Dalal,
Balazs Szorenyi,
Gugan Thoppe
Abstract:
Policy evaluation in reinforcement learning is often conducted using two-timescale stochastic approximation, which results in various gradient temporal difference methods such as GTD(0), GTD2, and TDC. Here, we provide convergence rate bounds for this suite of algorithms. Algorithms such as these have two iterates, $θ_n$ and $w_n,$ which are updated using two distinct stepsize sequences, $α_n$ and…
▽ More
Policy evaluation in reinforcement learning is often conducted using two-timescale stochastic approximation, which results in various gradient temporal difference methods such as GTD(0), GTD2, and TDC. Here, we provide convergence rate bounds for this suite of algorithms. Algorithms such as these have two iterates, $θ_n$ and $w_n,$ which are updated using two distinct stepsize sequences, $α_n$ and $β_n,$ respectively. Assuming $α_n = n^{-α}$ and $β_n = n^{-β}$ with $1 > α> β> 0,$ we show that, with high probability, the two iterates converge to their respective solutions $θ^*$ and $w^*$ at rates given by $\|θ_n - θ^*\| = \tilde{O}( n^{-α/2})$ and $\|w_n - w^*\| = \tilde{O}(n^{-β/2});$ here, $\tilde{O}$ hides logarithmic terms. Via comparable lower bounds, we show that these bounds are, in fact, tight. To the best of our knowledge, ours is the first finite-time analysis which achieves these rates. While it was known that the two timescale components decouple asymptotically, our results depict this phenomenon more explicitly by showing that it in fact happens from some finite time onwards. Lastly, compared to existing works, our result applies to a broader family of stepsizes, including non-square summable ones.
△ Less
Submitted 4 December, 2019; v1 submitted 20 November, 2019;
originally announced November 2019.
-
Optimal Learning of Mallows Block Model
Authors:
Róbert Busa-Fekete,
Dimitris Fotakis,
Balázs Szörényi,
Manolis Zampetakis
Abstract:
The Mallows model, introduced in the seminal paper of Mallows 1957, is one of the most fundamental ranking distribution over the symmetric group $S_m$. To analyze more complex ranking data, several studies considered the Generalized Mallows model defined by Fligner and Verducci 1986. Despite the significant research interest of ranking distributions, the exact sample complexity of estimating the p…
▽ More
The Mallows model, introduced in the seminal paper of Mallows 1957, is one of the most fundamental ranking distribution over the symmetric group $S_m$. To analyze more complex ranking data, several studies considered the Generalized Mallows model defined by Fligner and Verducci 1986. Despite the significant research interest of ranking distributions, the exact sample complexity of estimating the parameters of a Mallows and a Generalized Mallows Model is not well-understood. The main result of the paper is a tight sample complexity bound for learning Mallows and Generalized Mallows Model. We approach the learning problem by analyzing a more general model which interpolates between the single parameter Mallows Model and the $m$ parameter Mallows model. We call our model Mallows Block Model -- referring to the Block Models that are a popular model in theoretical statistics. Our sample complexity analysis gives tight bound for learning the Mallows Block Model for any number of blocks. We provide essentially matching lower bounds for our sample complexity results. As a corollary of our analysis, it turns out that, if the central ranking is known, one single sample from the Mallows Block Model is sufficient to estimate the spread parameters with error that goes to zero as the size of the permutations goes to infinity. In addition, we calculate the exact rate of the parameter estimation error.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
Learning to Crawl
Authors:
Utkarsh Upadhyay,
Robert Busa-Fekete,
Wojciech Kotlowski,
David Pal,
Balazs Szorenyi
Abstract:
Web crawling is the problem of kee** a cache of webpages fresh, i.e., having the most recent copy available when a page is requested. This problem is usually coupled with the natural restriction that the bandwidth available to the web crawler is limited. The corresponding optimization problem was solved optimally by Azar et al. [2018] under the assumption that, for each webpage, both the elapsed…
▽ More
Web crawling is the problem of kee** a cache of webpages fresh, i.e., having the most recent copy available when a page is requested. This problem is usually coupled with the natural restriction that the bandwidth available to the web crawler is limited. The corresponding optimization problem was solved optimally by Azar et al. [2018] under the assumption that, for each webpage, both the elapsed time between two changes and the elapsed time between two requests follow a Poisson distribution with known parameters. In this paper, we study the same control problem but under the assumption that the change rates are unknown a priori, and thus we need to estimate them in an online fashion using only partial observations (i.e., single-bit signals indicating whether the page has changed since the last refresh). As a point of departure, we characterise the conditions under which one can solve the problem with such partial observability. Next, we propose a practical estimator and compute confidence intervals for it in terms of the elapsed time between the observations. Finally, we show that the explore-and-commit algorithm achieves an $\mathcal{O}(\sqrt{T})$ regret with a carefully chosen exploration horizon. Our simulation study shows that our online policy scales well and achieves close to optimal performance for a wide range of the parameters.
△ Less
Submitted 22 November, 2019; v1 submitted 29 May, 2019;
originally announced May 2019.
-
Bandit Multiclass Linear Classification: Efficient Algorithms for the Separable Case
Authors:
Alina Beygelzimer,
Dávid Pál,
Balázs Szörényi,
Devanathan Thiruvenkatachari,
Chen-Yu Wei,
Chicheng Zhang
Abstract:
We study the problem of efficient online multiclass linear classification with bandit feedback, where all examples belong to one of $K$ classes and lie in the $d$-dimensional Euclidean space. Previous works have left open the challenge of designing efficient algorithms with finite mistake bounds when the data is linearly separable by a margin $γ$. In this work, we take a first step towards this pr…
▽ More
We study the problem of efficient online multiclass linear classification with bandit feedback, where all examples belong to one of $K$ classes and lie in the $d$-dimensional Euclidean space. Previous works have left open the challenge of designing efficient algorithms with finite mistake bounds when the data is linearly separable by a margin $γ$. In this work, we take a first step towards this problem. We consider two notions of linear separability: strong and weak.
1. Under the strong linear separability condition, we design an efficient algorithm that achieves a near-optimal mistake bound of $O\left( K/γ^2 \right)$.
2. Under the more challenging weak linear separability condition, we design an efficient algorithm with a mistake bound of $\min (2^{\widetilde{O}(K \log^2 (1/γ))}, 2^{\widetilde{O}(\sqrt{1/γ} \log K)})$. Our algorithm is based on kernel Perceptron, which is inspired by the work of (Klivans and Servedio, 2008) on improperly learning intersection of halfspaces.
△ Less
Submitted 18 June, 2019; v1 submitted 6 February, 2019;
originally announced February 2019.
-
The information-theoretic value of unlabeled data in semi-supervised learning
Authors:
Alexander Golovnev,
Dávid Pál,
Balázs Szörényi
Abstract:
We quantify the separation between the numbers of labeled examples required to learn in two settings: Settings with and without the knowledge of the distribution of the unlabeled data. More specifically, we prove a separation by $Θ(\log n)$ multiplicative factor for the class of projections over the Boolean hypercube of dimension $n$. We prove that there is no separation for the class of all funct…
▽ More
We quantify the separation between the numbers of labeled examples required to learn in two settings: Settings with and without the knowledge of the distribution of the unlabeled data. More specifically, we prove a separation by $Θ(\log n)$ multiplicative factor for the class of projections over the Boolean hypercube of dimension $n$. We prove that there is no separation for the class of all functions on domain of any size.
Learning with the knowledge of the distribution (a.k.a. fixed-distribution learning) can be viewed as an idealized scenario of semi-supervised learning where the number of unlabeled data points is so great that the unlabeled distribution is known exactly. For this reason, we call the separation the value of unlabeled data.
△ Less
Submitted 13 May, 2019; v1 submitted 16 January, 2019;
originally announced January 2019.
-
Multi-objective Bandits: Optimizing the Generalized Gini Index
Authors:
Robert Busa-Fekete,
Balazs Szorenyi,
Paul Weng,
Shie Mannor
Abstract:
We study the multi-armed bandit (MAB) problem where the agent receives a vectorial feedback that encodes many possibly competing objectives to be optimized. The goal of the agent is to find a policy, which can optimize these objectives simultaneously in a fair way. This multi-objective online optimization problem is formalized by using the Generalized Gini Index (GGI) aggregation function. We prop…
▽ More
We study the multi-armed bandit (MAB) problem where the agent receives a vectorial feedback that encodes many possibly competing objectives to be optimized. The goal of the agent is to find a policy, which can optimize these objectives simultaneously in a fair way. This multi-objective online optimization problem is formalized by using the Generalized Gini Index (GGI) aggregation function. We propose an online gradient descent algorithm which exploits the convexity of the GGI aggregation function, and controls the exploration in a careful way achieving a distribution-free regret $\tilde{\bigO} (T^{-1/2} )$ with high probability. We test our algorithm on synthetic data as well as on an electric battery control problem where the goal is to trade off the use of the different cells of a battery in order to balance their respective degradation rates.
△ Less
Submitted 15 June, 2017;
originally announced June 2017.
-
Finite Sample Analyses for TD(0) with Function Approximation
Authors:
Gal Dalal,
Balázs Szörényi,
Gugan Thoppe,
Shie Mannor
Abstract:
TD(0) is one of the most commonly used algorithms in reinforcement learning. Despite this, there is no existing finite sample analysis for TD(0) with function approximation, even for the linear case. Our work is the first to provide such results. Existing convergence rates for Temporal Difference (TD) methods apply only to somewhat modified versions, e.g., projected variants or ones where stepsize…
▽ More
TD(0) is one of the most commonly used algorithms in reinforcement learning. Despite this, there is no existing finite sample analysis for TD(0) with function approximation, even for the linear case. Our work is the first to provide such results. Existing convergence rates for Temporal Difference (TD) methods apply only to somewhat modified versions, e.g., projected variants or ones where stepsizes depend on unknown problem parameters. Our analyses obviate these artificial alterations by exploiting strong properties of TD(0). We provide convergence rates both in expectation and with high-probability. The two are obtained via different approaches that use relatively unknown, recently developed stochastic approximation techniques.
△ Less
Submitted 11 December, 2017; v1 submitted 4 April, 2017;
originally announced April 2017.
-
Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning
Authors:
Gal Dalal,
Balazs Szorenyi,
Gugan Thoppe,
Shie Mannor
Abstract:
Two-timescale Stochastic Approximation (SA) algorithms are widely used in Reinforcement Learning (RL). Their iterates have two parts that are updated using distinct stepsizes. In this work, we develop a novel recipe for their finite sample analysis. Using this, we provide a concentration bound, which is the first such result for a two-timescale SA. The type of bound we obtain is known as `lock-in…
▽ More
Two-timescale Stochastic Approximation (SA) algorithms are widely used in Reinforcement Learning (RL). Their iterates have two parts that are updated using distinct stepsizes. In this work, we develop a novel recipe for their finite sample analysis. Using this, we provide a concentration bound, which is the first such result for a two-timescale SA. The type of bound we obtain is known as `lock-in probability'. We also introduce a new projection scheme, in which the time between successive projections increases exponentially. This scheme allows one to elegantly transform a lock-in probability into a convergence rate result for projected two-timescale SA. From this latter result, we then extract key insights on stepsize selection. As an application, we finally obtain convergence rates for the projected two-timescale RL algorithms GTD(0), GTD2, and TDC.
△ Less
Submitted 4 June, 2018; v1 submitted 15 March, 2017;
originally announced March 2017.
-
Distributed Clustering of Linear Bandits in Peer to Peer Networks
Authors:
Nathan Korda,
Balazs Szorenyi,
Shuai Li
Abstract:
We provide two distributed confidence ball algorithms for solving linear bandit problems in peer to peer networks with limited communication capabilities. For the first, we assume that all the peers are solving the same linear bandit problem, and prove that our algorithm achieves the optimal asymptotic regret rate of any centralised algorithm that can instantly communicate information between the…
▽ More
We provide two distributed confidence ball algorithms for solving linear bandit problems in peer to peer networks with limited communication capabilities. For the first, we assume that all the peers are solving the same linear bandit problem, and prove that our algorithm achieves the optimal asymptotic regret rate of any centralised algorithm that can instantly communicate information between the peers. For the second, we assume that there are clusters of peers solving the same bandit problem within each cluster, and we prove that our algorithm discovers these clusters, while achieving the optimal asymptotic regret rate within each one. Through experiments on several real-world datasets, we demonstrate the performance of proposed algorithms compared to the state-of-the-art.
△ Less
Submitted 7 June, 2016; v1 submitted 26 April, 2016;
originally announced April 2016.
-
Biclique coverings, rectifier networks and the cost of $\varepsilon$-removal
Authors:
Szabolcs Iván,
Ádám Dániel Lelkes,
Judit Nagy-György,
Balázs Szörényi,
György Turán
Abstract:
We relate two complexity notions of bipartite graphs: the minimal weight biclique covering number $\mathrm{Cov}(G)$ and the minimal rectifier network size $\mathrm{Rect}(G)$ of a bipartite graph $G$. We show that there exist graphs with $\mathrm{Cov}(G)\geq \mathrm{Rect}(G)^{3/2-ε}$. As a corollary, we establish that there exist nondeterministic finite automata (NFAs) with $\varepsilon$-transition…
▽ More
We relate two complexity notions of bipartite graphs: the minimal weight biclique covering number $\mathrm{Cov}(G)$ and the minimal rectifier network size $\mathrm{Rect}(G)$ of a bipartite graph $G$. We show that there exist graphs with $\mathrm{Cov}(G)\geq \mathrm{Rect}(G)^{3/2-ε}$. As a corollary, we establish that there exist nondeterministic finite automata (NFAs) with $\varepsilon$-transitions, having $n$ transitions total such that the smallest equivalent $\varepsilon$-free NFA has $Ω(n^{3/2-ε})$ transitions. We also formulate a version of previous bounds for the weighted set cover problem and discuss its connections to giving upper bounds for the possible blow-up.
△ Less
Submitted 30 May, 2014;
originally announced June 2014.