Search | arXiv e-print repository

Optimal Multi-Fidelity Best-Arm Identification

Authors: Riccardo Poiani, Rémy Degenne, Emilie Kaufmann, Alberto Maria Metelli, Marcello Restelli

Abstract: In bandit best-arm identification, an algorithm is tasked with finding the arm with highest mean reward with a specified accuracy as fast as possible. We study multi-fidelity best-arm identification, in which the algorithm can choose to sample an arm at a lower fidelity (less accurate mean estimate) for a lower cost. Several methods have been proposed for tackling this problem, but their optimalit… ▽ More In bandit best-arm identification, an algorithm is tasked with finding the arm with highest mean reward with a specified accuracy as fast as possible. We study multi-fidelity best-arm identification, in which the algorithm can choose to sample an arm at a lower fidelity (less accurate mean estimate) for a lower cost. Several methods have been proposed for tackling this problem, but their optimality remain elusive, notably due to loose lower bounds on the total cost needed to identify the best arm. Our first contribution is a tight, instance-dependent lower bound on the cost complexity. The study of the optimization problem featured in the lower bound provides new insights to devise computationally efficient algorithms, and leads us to propose a gradient-based approach with asymptotically optimal cost complexity. We demonstrate the benefits of the new algorithm compared to existing methods in experiments. Our theoretical and empirical findings also shed light on an intriguing concept of optimal fidelity for each arm. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2405.17108 [pdf, ps, other]

Finding good policies in average-reward Markov Decision Processes without prior knowledge

Authors: Adrienne Tuynman, Rémy Degenne, Emilie Kaufmann

Abstract: We revisit the identification of an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, $D$, and the optimal bias span, $H$, which satisfy $H\leq D$. Prior work have studied the complexity of $\varepsilon$-optimal policy identification only when a generative model is available. In th… ▽ More We revisit the identification of an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, $D$, and the optimal bias span, $H$, which satisfy $H\leq D$. Prior work have studied the complexity of $\varepsilon$-optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with $D \simeq H$ for which the sample complexity to output an $\varepsilon$-optimal policy is $Ω(SAD/\varepsilon^2)$ where $S$ and $A$ are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order $SAH/\varepsilon^2$ has been proposed, but it requires the knowledge of $H$. We first show that the sample complexity required to estimate $H$ is not bounded by any function of $S,A$ and $H$, ruling out the possibility to easily make the previous algorithm agnostic to $H$. By relying instead on a diameter estimation procedure, we propose the first algorithm for $(\varepsilon,δ)$-PAC policy identification that does not need any form of prior knowledge on the MDP. Its sample complexity scales in $SAD/\varepsilon^2$ in the regime of small $\varepsilon$, which is near-optimal. In the online setting, our first contribution is a lower bound which implies that a sample complexity polynomial in $H$ cannot be achieved in this setting. Then, we propose an online algorithm with a sample complexity in $SAD^2/\varepsilon^2$, as well as a novel approach based on a data-dependent stop** rule that we believe is promising to further reduce this bound. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2403.01892 [pdf, other]

Information Lower Bounds for Robust Mean Estimation

Authors: Rémy Degenne, Timothée Mathieu

Abstract: We prove lower bounds on the error of any estimator for the mean of a real probability distribution under the knowledge that the distribution belongs to a given set. We apply these lower bounds both to parametric and nonparametric estimation. In the nonparametric case, we apply our results to the question of sub-Gaussian estimation for distributions with finite variance to obtain new lower bounds… ▽ More We prove lower bounds on the error of any estimator for the mean of a real probability distribution under the knowledge that the distribution belongs to a given set. We apply these lower bounds both to parametric and nonparametric estimation. In the nonparametric case, we apply our results to the question of sub-Gaussian estimation for distributions with finite variance to obtain new lower bounds in the small error probability regime, and present an optimal estimator in that regime. In the (semi-)parametric case, we use the Fisher information to provide distribution-dependent lower bounds that are constant-tight asymptotically, of order $\sqrt{2\log(1/δ)/(nI)}$ where $I$ is the Fisher information of the distribution. We use known minimizers of the Fisher information on some nonparametric set of distributions to give lower bounds in cases such as corrupted distributions, or bounded/semi-bounded distributions. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2305.16041 [pdf, other]

An $\varepsilon$-Best-Arm Identification Algorithm for Fixed-Confidence and Beyond

Authors: Marc Jourdan, Rémy Degenne, Emilie Kaufmann

Abstract: We propose EB-TC$\varepsilon$, a novel sampling rule for $\varepsilon$-best arm identification in stochastic bandits. It is the first instance of Top Two algorithm analyzed for approximate best arm identification. EB-TC$\varepsilon$ is an *anytime* sampling rule that can therefore be employed without modification for fixed confidence or fixed budget identification (without prior knowledge of the b… ▽ More We propose EB-TC$\varepsilon$, a novel sampling rule for $\varepsilon$-best arm identification in stochastic bandits. It is the first instance of Top Two algorithm analyzed for approximate best arm identification. EB-TC$\varepsilon$ is an *anytime* sampling rule that can therefore be employed without modification for fixed confidence or fixed budget identification (without prior knowledge of the budget). We provide three types of theoretical guarantees for EB-TC$\varepsilon$. First, we prove bounds on its expected sample complexity in the fixed confidence setting, notably showing its asymptotic optimality in combination with an adaptive tuning of its exploration parameter. We complement these findings with upper bounds on its probability of error at any time and for any error parameter, which further yield upper bounds on its simple regret at any time. Finally, we show through numerical simulations that EB-TC$\varepsilon$ performs favorably compared to existing algorithms, in different settings. △ Less

Submitted 6 November, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: 68 pages, 14 figures, 4 tables. To be published in the Thirty-seventh Conference on Neural Information Processing Systems

arXiv:2303.09468 [pdf, ps, other]

On the Existence of a Complexity in Fixed Budget Bandit Identification

Authors: Rémy Degenne

Abstract: In fixed budget bandit identification, an algorithm sequentially observes samples from several distributions up to a given final time. It then answers a query about the set of distributions. A good algorithm will have a small probability of error. While that probability decreases exponentially with the final time, the best attainable rate is not known precisely for most identification tasks. We sh… ▽ More In fixed budget bandit identification, an algorithm sequentially observes samples from several distributions up to a given final time. It then answers a query about the set of distributions. A good algorithm will have a small probability of error. While that probability decreases exponentially with the final time, the best attainable rate is not known precisely for most identification tasks. We show that if a fixed budget task admits a complexity, defined as a lower bound on the probability of error which is attained by the same algorithm on all bandit problems, then that complexity is determined by the best non-adaptive sampling procedure for that problem. We show that there is no such complexity for several fixed budget identification tasks including Bernoulli best arm identification with two arms: there is no single algorithm that attains everywhere the best possible rate. △ Less

Submitted 30 June, 2023; v1 submitted 16 March, 2023; originally announced March 2023.

Comments: 24 pages, 36th Annual Conference on Learning Theory, Proceedings of Machine Learning Research vol 195

arXiv:2212.05578 [pdf, other]

A Formalization of Doob's Martingale Convergence Theorems in mathlib

Authors: Kexing Ying, Rémy Degenne

Abstract: We present the formalization of Doob's martingale convergence theorems in the mathlib library for the Lean theorem prover. These theorems give conditions under which (sub)martingales converge, almost everywhere or in $L^1$. In order to formalize those results, we build a definition of the conditional expectation in Banach spaces and develop the theory of stochastic processes, stop** times and ma… ▽ More We present the formalization of Doob's martingale convergence theorems in the mathlib library for the Lean theorem prover. These theorems give conditions under which (sub)martingales converge, almost everywhere or in $L^1$. In order to formalize those results, we build a definition of the conditional expectation in Banach spaces and develop the theory of stochastic processes, stop** times and martingales. As an application of the convergence theorems, we also present the formalization of Lévy's generalized Borel-Cantelli lemma. This work on martingale theory is one of the first developments of probability theory in mathlib, and it builds upon diverse parts of that library such as topology, analysis and most importantly measure theory. △ Less

Submitted 11 December, 2022; originally announced December 2022.

arXiv:2210.05431 [pdf, other]

Non-Asymptotic Analysis of a UCB-based Top Two Algorithm

Authors: Marc Jourdan, Rémy Degenne

Abstract: A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the as… ▽ More A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance. △ Less

Submitted 6 November, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

Comments: 32 pages, 5 figures, 3 tables. To be published in the Thirty-seventh Conference on Neural Information Processing Systems

arXiv:2210.00974 [pdf, other]

Dealing with Unknown Variances in Best-Arm Identification

Authors: Marc Jourdan, Rémy Degenne, Emilie Kaufmann

Abstract: The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting t… ▽ More The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting the transportation costs. In order to calibrate our two stop** rules, we derive new time-uniform concentration inequalities, which are of independent interest. Then, we illustrate the theoretical and empirical performances of our two sampling rule wrappers on Track-and-Stop and on a Top Two algorithm. Moreover, by quantifying the impact on the sample complexity of not knowing the variances, we reveal that it is rather small. △ Less

Submitted 23 January, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: 73 pages, 5 figures, 3 tables. To be published in the 34th International Conference on Algorithmic Learning Theory, Singapore, 2023

arXiv:2206.05979 [pdf, other]

Top Two Algorithms Revisited

Authors: Marc Jourdan, Rémy Degenne, Dorian Baudry, Rianne de Heide, Emilie Kaufmann

Abstract: Top Two algorithms arose as an adaptation of Thompson sampling to best arm identification in multi-armed bandit models (Russo, 2016), for parametric families of arms. They select the next arm to sample from by randomizing among two candidate arms, a leader and a challenger. Despite their good empirical performance, theoretical guarantees for fixed-confidence best arm identification have only been… ▽ More Top Two algorithms arose as an adaptation of Thompson sampling to best arm identification in multi-armed bandit models (Russo, 2016), for parametric families of arms. They select the next arm to sample from by randomizing among two candidate arms, a leader and a challenger. Despite their good empirical performance, theoretical guarantees for fixed-confidence best arm identification have only been obtained when the arms are Gaussian with known variances. In this paper, we provide a general analysis of Top Two methods, which identifies desirable properties of the leader, the challenger, and the (possibly non-parametric) distributions of the arms. As a result, we obtain theoretically supported Top Two algorithms for best arm identification with bounded distributions. Our proof method demonstrates in particular that the sampling step used to select the leader inherited from Thompson sampling can be replaced by other choices, like selecting the empirical best arm. △ Less

Submitted 4 October, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

Comments: 75 pages, 8 figures, 3 tables

arXiv:2206.04456 [pdf, other]

Choosing Answers in $\varepsilon$-Best-Answer Identification for Linear Bandits

Authors: Marc Jourdan, Rémy Degenne

Abstract: In pure-exploration problems, information is gathered sequentially to answer a question on the stochastic environment. While best-arm identification for linear bandits has been extensively studied in recent years, few works have been dedicated to identifying one arm that is $\varepsilon$-close to the best one (and not exactly the best one). In this problem with several correct answers, an identifi… ▽ More In pure-exploration problems, information is gathered sequentially to answer a question on the stochastic environment. While best-arm identification for linear bandits has been extensively studied in recent years, few works have been dedicated to identifying one arm that is $\varepsilon$-close to the best one (and not exactly the best one). In this problem with several correct answers, an identification algorithm should focus on one candidate among those answers and verify that it is correct. We demonstrate that picking the answer with highest mean does not allow an algorithm to reach asymptotic optimality in terms of expected sample complexity. Instead, a \textit{furthest answer} should be identified. Using that insight to choose the candidate answer carefully, we develop a simple procedure to adapt best-arm identification algorithms to tackle $\varepsilon$-best-answer identification in transductive linear stochastic bandits. Finally, we propose an asymptotically optimal algorithm for this setting, which is shown to achieve competitive empirical performance against existing modified best-arm identification algorithms. △ Less

Submitted 9 June, 2022; originally announced June 2022.

Comments: 47 pages, 10 figures, 8 tables. To be published in the 39th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022

arXiv:2205.10936 [pdf, other]

On Elimination Strategies for Bandit Fixed-Confidence Identification

Authors: Andrea Tirinzoni, Rémy Degenne

Abstract: Elimination algorithms for bandit identification, which prune the plausible correct answers sequentially until only one remains, are computationally convenient since they reduce the problem size over time. However, existing elimination strategies are often not fully adaptive (they update their sampling rule infrequently) and are not easy to extend to combinatorial settings, where the set of answer… ▽ More Elimination algorithms for bandit identification, which prune the plausible correct answers sequentially until only one remains, are computationally convenient since they reduce the problem size over time. However, existing elimination strategies are often not fully adaptive (they update their sampling rule infrequently) and are not easy to extend to combinatorial settings, where the set of answers is exponentially large in the problem dimension. On the other hand, most existing fully-adaptive strategies to tackle general identification problems are computationally demanding since they repeatedly test the correctness of every answer, without ever reducing the problem size. We show that adaptive methods can be modified to use elimination in both their stop** and sampling rules, hence obtaining the best of these two worlds: the algorithms (1) remain fully adaptive, (2) suffer a sample complexity that is never worse of their non-elimination counterpart, and (3) provably eliminate certain wrong answers early. We confirm these benefits experimentally, where elimination improves significantly the computational complexity of adaptive methods on common tasks like best-arm identification in linear bandits. △ Less

Submitted 24 October, 2022; v1 submitted 22 May, 2022; originally announced May 2022.

arXiv:2111.01479 [pdf, other]

Dealing With Misspecification In Fixed-Confidence Linear Top-m Identification

Authors: Clémence Réda, Andrea Tirinzoni, Rémy Degenne

Abstract: We study the problem of the identification of m arms with largest means under a fixed error rate $δ$ (fixed-confidence Top-m identification), for misspecified linear bandit models. This problem is motivated by practical applications, especially in medicine and recommendation systems, where linear models are popular due to their simplicity and the existence of efficient algorithms, but in which dat… ▽ More We study the problem of the identification of m arms with largest means under a fixed error rate $δ$ (fixed-confidence Top-m identification), for misspecified linear bandit models. This problem is motivated by practical applications, especially in medicine and recommendation systems, where linear models are popular due to their simplicity and the existence of efficient algorithms, but in which data inevitably deviates from linearity. In this work, we first derive a tractable lower bound on the sample complexity of any $δ$-correct algorithm for the general Top-m identification problem. We show that knowing the scale of the deviation from linearity is necessary to exploit the structure of the problem. We then describe the first algorithm for this setting, which is both practical and adapts to the amount of misspecification. We derive an upper bound to its sample complexity which confirms this adaptivity and that matches the lower bound when $δ$ $\rightarrow$ 0. Finally, we evaluate our algorithm on both synthetic and real-world data, showing competitive performance with respect to existing baselines. △ Less

Submitted 2 November, 2021; originally announced November 2021.

Comments: Virtual conference

arXiv:2110.09133 [pdf, other]

Online Sign Identification: Minimization of the Number of Errors in Thresholding Bandits

Authors: Reda Ouhamma, Rémy Degenne, Pierre Gaillard, Vianney Perchet

Abstract: In the fixed budget thresholding bandit problem, an algorithm sequentially allocates a budgeted number of samples to different distributions. It then predicts whether the mean of each distribution is larger or lower than a given threshold. We introduce a large family of algorithms (containing most existing relevant ones), inspired by the Frank-Wolfe algorithm, and provide a thorough yet generic an… ▽ More In the fixed budget thresholding bandit problem, an algorithm sequentially allocates a budgeted number of samples to different distributions. It then predicts whether the mean of each distribution is larger or lower than a given threshold. We introduce a large family of algorithms (containing most existing relevant ones), inspired by the Frank-Wolfe algorithm, and provide a thorough yet generic analysis of their performance. This allowed us to construct new explicit algorithms, for a broad class of problems, whose losses are within a small constant factor of the non-adaptive oracle ones. Quite interestingly, we observed that adaptive methods empirically greatly out-perform non-adaptive oracles, an uncommon behavior in standard online learning settings, such as regret minimization. We explain this surprising phenomenon on an insightful toy problem. △ Less

Submitted 18 October, 2021; originally announced October 2021.

Comments: 10+15 pages. To be published in the proceedings of NeurIPS 2021

arXiv:2007.00969 [pdf, other]

Structure Adaptive Algorithms for Stochastic Bandits

Authors: Rémy Degenne, Han Shao, Wouter M. Koolen

Abstract: We study reward maximisation in a wide class of structured stochastic multi-armed bandit problems, where the mean rewards of arms satisfy some given structural constraints, e.g. linear, unimodal, sparse, etc. Our aim is to develop methods that are flexible (in that they easily adapt to different structures), powerful (in that they perform well empirically and/or provably match instance-dependent l… ▽ More We study reward maximisation in a wide class of structured stochastic multi-armed bandit problems, where the mean rewards of arms satisfy some given structural constraints, e.g. linear, unimodal, sparse, etc. Our aim is to develop methods that are flexible (in that they easily adapt to different structures), powerful (in that they perform well empirically and/or provably match instance-dependent lower bounds) and efficient in that the per-round computational burden is small. We develop asymptotically optimal algorithms from instance-dependent lower-bounds using iterative saddle-point solvers. Our approach generalises recent iterative methods for pure exploration to reward maximisation, where a major challenge arises from the estimation of the sub-optimality gaps and their reciprocals. Still we manage to achieve all the above desiderata. Notably, our technique avoids the computational cost of the full-blown saddle point oracle employed by previous work, while at the same time enabling finite-time regret bounds. Our experiments reveal that our method successfully leverages the structural assumptions, while its regret is at worst comparable to that of vanilla UCB. △ Less

Submitted 2 July, 2020; originally announced July 2020.

Comments: 10+18 pages. To be published in the proceedings of ICML 2020

arXiv:2007.00953 [pdf, other]

Gamification of Pure Exploration for Linear Bandits

Authors: Rémy Degenne, Pierre Ménard, Xuedong Shang, Michal Valko

Abstract: We investigate an active pure-exploration setting, that includes best-arm identification, in the context of linear stochastic bandits. While asymptotically optimal algorithms exist for standard multi-arm bandits, the existence of such algorithms for the best-arm identification in linear bandits has been elusive despite several attempts to address it. First, we provide a thorough comparison and new… ▽ More We investigate an active pure-exploration setting, that includes best-arm identification, in the context of linear stochastic bandits. While asymptotically optimal algorithms exist for standard multi-arm bandits, the existence of such algorithms for the best-arm identification in linear bandits has been elusive despite several attempts to address it. First, we provide a thorough comparison and new insight over different notions of optimality in the linear case, including G-optimality, transductive optimality from optimal experimental design and asymptotic optimality. Second, we design the first asymptotically optimal algorithm for fixed-confidence pure exploration in linear bandits. As a consequence, our algorithm naturally bypasses the pitfall caused by a simple but difficult instance, that most prior algorithms had to be engineered to deal with explicitly. Finally, we avoid the need to fully solve an optimal design problem by providing an approach that entails an efficient implementation. △ Less

Submitted 2 July, 2020; originally announced July 2020.

Comments: 11+25 pages. To be published in the proceedings of ICML 2020

arXiv:1906.10431 [pdf, other]

Non-Asymptotic Pure Exploration by Solving Games

Authors: Rémy Degenne, Wouter M. Koolen, Pierre Ménard

Abstract: Pure exploration (aka active testing) is the fundamental task of sequentially gathering information to answer a query about a stochastic environment. Good algorithms make few mistakes and take few samples. Lower bounds (for multi-armed bandit models with arms in an exponential family) reveal that the sample complexity is determined by the solution to an optimisation problem. The existing state of… ▽ More Pure exploration (aka active testing) is the fundamental task of sequentially gathering information to answer a query about a stochastic environment. Good algorithms make few mistakes and take few samples. Lower bounds (for multi-armed bandit models with arms in an exponential family) reveal that the sample complexity is determined by the solution to an optimisation problem. The existing state of the art algorithms achieve asymptotic optimality by solving a plug-in estimate of that optimisation problem at each step. We interpret the optimisation problem as an unknown game, and propose sampling rules based on iterative strategies to estimate and converge to its saddle point. We apply no-regret learners to obtain the first finite confidence guarantees that are adapted to the exponential family and which apply to any pure exploration query and bandit structure. Moreover, our algorithms only use a best response oracle instead of fully solving the optimisation problem. △ Less

Submitted 25 June, 2019; originally announced June 2019.

arXiv:1902.03475 [pdf, other]

Pure Exploration with Multiple Correct Answers

Authors: Rémy Degenne, Wouter M. Koolen

Abstract: We determine the sample complexity of pure exploration bandit problems with multiple good answers. We derive a lower bound using a new game equilibrium argument. We show how continuity and convexity properties of single-answer problems ensures that the Track-and-Stop algorithm has asymptotically optimal sample complexity. However, that convexity is lost when going to the multiple-answer setting. W… ▽ More We determine the sample complexity of pure exploration bandit problems with multiple good answers. We derive a lower bound using a new game equilibrium argument. We show how continuity and convexity properties of single-answer problems ensures that the Track-and-Stop algorithm has asymptotically optimal sample complexity. However, that convexity is lost when going to the multiple-answer setting. We present a new algorithm which extends Track-and-Stop to the multiple-answer case and has asymptotic sample complexity matching the lower bound. △ Less

Submitted 9 February, 2019; originally announced February 2019.

arXiv:1810.04088 [pdf, other]

Bridging the gap between regret minimization and best arm identification, with application to A/B tests

Authors: Rémy Degenne, Thomas Nedelec, Clément Calauzènes, Vianney Perchet

Abstract: State of the art online learning procedures focus either on selecting the best alternative ("best arm identification") or on minimizing the cost (the "regret"). We merge these two objectives by providing the theoretical analysis of cost minimizing algorithms that are also delta-PAC (with a proven guaranteed bound on the decision time), hence fulfilling at the same time regret minimization and best… ▽ More State of the art online learning procedures focus either on selecting the best alternative ("best arm identification") or on minimizing the cost (the "regret"). We merge these two objectives by providing the theoretical analysis of cost minimizing algorithms that are also delta-PAC (with a proven guaranteed bound on the decision time), hence fulfilling at the same time regret minimization and best arm identification. This analysis sheds light on the common observation that ill-callibrated UCB-algorithms minimize regret while still identifying quickly the best arm. We also extend these results to the non-iid case faced by many practitioners. This provides a technique to make cost versus decision time compromise when doing adaptive tests with applications ranging from website A/B testing to clinical trials. △ Less

Submitted 26 February, 2019; v1 submitted 9 October, 2018; originally announced October 2018.

Journal ref: AISTATS 2019 proceedings

arXiv:1807.03558 [pdf, other]

Bandits with Side Observations: Bounded vs. Logarithmic Regret

Authors: Rémy Degenne, Evrard Garcelon, Vianney Perchet

Abstract: We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency $ε$, an extra observation is gathered by the agent for free. We prove that, no matter how small $ε$ is the agent can ensure a regret uniformly bounded in time. More precisely, we construct an algorithm with a regret smaller than $\sum_i \frac{\log(1/ε)}{Δ_i}$, up to multiplicative cons… ▽ More We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency $ε$, an extra observation is gathered by the agent for free. We prove that, no matter how small $ε$ is the agent can ensure a regret uniformly bounded in time. More precisely, we construct an algorithm with a regret smaller than $\sum_i \frac{\log(1/ε)}{Δ_i}$, up to multiplicative constant and loglog terms. We also prove a matching lower-bound, stating that no reasonable algorithm can outperform this quantity. △ Less

Submitted 10 July, 2018; originally announced July 2018.

Comments: Conference on Uncertainty in Artificial Intelligence (UAI) 2018, 21 pages

arXiv:1612.01859 [pdf, other]

Combinatorial semi-bandit with known covariance

Authors: Rémy Degenne, Vianney Perchet

Abstract: The combinatorial stochastic semi-bandit problem is an extension of the classical multi-armed bandit problem in which an algorithm pulls more than one arm at each stage and the rewards of all pulled arms are revealed. One difference with the single arm variant is that the dependency structure of the arms is crucial. Previous works on this setting either used a worst-case approach or imposed indepe… ▽ More The combinatorial stochastic semi-bandit problem is an extension of the classical multi-armed bandit problem in which an algorithm pulls more than one arm at each stage and the rewards of all pulled arms are revealed. One difference with the single arm variant is that the dependency structure of the arms is crucial. Previous works on this setting either used a worst-case approach or imposed independence of the arms. We introduce a way to quantify the dependency structure of the problem and design an algorithm that adapts to it. The algorithm is based on linear regression and the analysis develops techniques from the linear bandit literature. By comparing its performance to a new lower bound, we prove that it is optimal, up to a poly-logarithmic factor in the number of pulled arms. △ Less

Submitted 6 December, 2016; originally announced December 2016.

Comments: in NIPS 2016 (Conference on Neural Information Processing Systems), Dec 2016, Barcelona, Spain

Showing 1–20 of 20 results for author: Degenne, R