Skip to main content

Showing 1–33 of 33 results for author: Menard, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2310.18186  [pdf, other

    stat.ML cs.LG

    Model-free Posterior Sampling via Learning Rate Randomization

    Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Menard

    Abstract: In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the best of our knowledge, RandQL is the first tractable model-free posterior sampling-based algorithm. We analyze the performance of RandQL in both tabular and non-tabular metric space settings. In tabular MDPs, RandQL achieve… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

    Comments: NeurIPS-2023

  2. arXiv:2310.17303  [pdf, ps, other

    stat.ML cs.LG

    Demonstration-Regularized RL

    Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Menard

    Abstract: Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavio… ▽ More

    Submitted 10 June, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

    Comments: This revision fixes an error due to use of some incorrect results (Lemma 32, Corollary 11 by Talebi & Maillard, 2018) in the proof of Theorem 8. The condition for the RLHF results have slightly changed

  3. arXiv:2309.00656  [pdf, other

    cs.GT cs.LG stat.ML

    Local and adaptive mirror descents in extensive-form games

    Authors: Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko

    Abstract: We study how to learn $ε$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially based on their observations over a fixed number of episodes, denoted by $T$. Existing procedures suffer from high variance due to the use of importance sampling over sequences of actions (Steinberger et al., 2020; McAleer e… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

  4. arXiv:2305.13185  [pdf, other

    cs.LG

    Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice

    Authors: Toshinori Kitamura, Tadashi Kozuno, Yunhao Tang, Nino Vieillard, Michal Valko, Wenhao Yang, **cheng Mei, Pierre Ménard, Mohammad Gheshlaghi Azar, Rémi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesvári, Wataru Kumagai, Yutaka Matsuo

    Abstract: Mirror descent value iteration (MDVI), an abstraction of Kullback-Leibler (KL) and entropy-regularized reinforcement learning (RL), has served as the basis for recent high-performing practical RL algorithms. However, despite the use of function approximation in practice, the theoretical understanding of MDVI has been limited to tabular Markov decision processes (MDPs). We study MDVI with linear fu… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: ICML 2023 accepted

  5. arXiv:2303.14811  [pdf, other

    cs.LG stat.ML

    Learning Generative Models with Goal-conditioned Reinforcement Learning

    Authors: Mariana Vargas Vieyra, Pierre Ménard

    Abstract: We present a novel, alternative framework for learning generative models with goal-conditioned reinforcement learning. We define two agents, a goal conditioned agent (GC-agent) and a supervised agent (S-agent). Given a user-input initial state, the GC-agent learns to reconstruct the training set. In this context, elements in the training set are the goals. During training, the S-agent learns to im… ▽ More

    Submitted 26 March, 2023; originally announced March 2023.

  6. arXiv:2303.08059  [pdf, other

    stat.ML cs.LG

    Fast Rates for Maximum Entropy Exploration

    Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Yunhao Tang, Michal Valko, Pierre Menard

    Abstract: We address the challenge of exploration in reinforcement learning (RL) when the agent operates in an unknown environment with sparse or no rewards. In this work, we study the maximum entropy exploration problem of two different types. The first type is visitation entropy maximization previously considered by Hazan et al.(2019) in the discounted setting. For this type of exploration, we propose a g… ▽ More

    Submitted 6 June, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

    Comments: ICML-2023

  7. arXiv:2212.12567  [pdf, other

    stat.ML cs.LG

    Adapting to game trees in zero-sum imperfect information games

    Authors: Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko

    Abstract: Imperfect information games (IIG) are games in which each player only partially observes the current game state. We study how to learn $ε$-optimal strategies in a zero-sum IIG through self-play with trajectory feedback. We give a problem-independent lower bound $\widetilde{\mathcal{O}}(H(A_{\mathcal{X}}+B_{\mathcal{Y}})/ε^2)$ on the required number of realizations to learn these strategies with hi… ▽ More

    Submitted 15 February, 2023; v1 submitted 23 December, 2022; originally announced December 2022.

  8. arXiv:2209.14414  [pdf, other

    stat.ML cs.LG

    Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees

    Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Mark Rowland, Michal Valko, Pierre Menard

    Abstract: We consider reinforcement learning in an environment modeled by an episodic, finite, stage-dependent Markov decision process of horizon $H$ with $S$ states, and $A$ actions. The performance of an agent is measured by the regret after interacting with the environment for $T$ episodes. We propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL), a simple variant of poste… ▽ More

    Submitted 28 September, 2022; originally announced September 2022.

    Comments: arXiv admin note: text overlap with arXiv:2205.07704

  9. arXiv:2205.14211  [pdf, other

    cs.LG cs.AI stat.ML

    KL-Entropy-Regularized RL with a Generative Model is Minimax Optimal

    Authors: Tadashi Kozuno, Wenhao Yang, Nino Vieillard, Toshinori Kitamura, Yunhao Tang, **cheng Mei, Pierre Ménard, Mohammad Gheshlaghi Azar, Michal Valko, Rémi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesvári

    Abstract: In this work, we consider and analyze the sample complexity of model-free reinforcement learning with a generative model. Particularly, we analyze mirror descent value iteration (MDVI) by Geist et al. (2019) and Vieillard et al. (2020a), which uses the Kullback-Leibler divergence and entropy regularization in its value and policy updates. Our analysis shows that it is nearly minimax-optimal for fi… ▽ More

    Submitted 27 May, 2022; originally announced May 2022.

    Comments: 29 pages, 6 figures

  10. arXiv:2205.07704  [pdf, other

    stat.ML cs.LG

    From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses

    Authors: Daniil Tiapkin, Denis Belomestny, Eric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang, Michal Valko, Pierre Menard

    Abstract: We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For Bayes-UCBVI, we prove a regret bound of order… ▽ More

    Submitted 22 June, 2022; v1 submitted 16 May, 2022; originally announced May 2022.

  11. arXiv:2112.01452  [pdf, other

    cs.AI cs.LG math.OC stat.ML

    Indexed Minimum Empirical Divergence for Unimodal Bandits

    Authors: Hassan Saber, Pierre Ménard, Odalric-Ambrym Maillard

    Abstract: We consider a multi-armed bandit problem specified by a set of one-dimensional family exponential distributions endowed with a unimodal structure. We introduce IMED-UB, a algorithm that optimally exploits the unimodal-structure, by adapting to this setting the Indexed Minimum Empirical Divergence (IMED) algorithm introduced by Honda and Takemura [2015]. Owing to our proof technique, we are able to… ▽ More

    Submitted 2 December, 2021; originally announced December 2021.

    Comments: NeurIPS 2021 - International Conference on Neural Information Processing Systems, Dec 2021, Virtual-only Conference, United States. arXiv admin note: substantial text overlap with arXiv:2006.16569, arXiv:2007.03224

  12. arXiv:2111.12045  [pdf, other

    cs.LG

    Adaptive Multi-Goal Exploration

    Authors: Jean Tarbouriech, Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Michal Valko, Alessandro Lazaric

    Abstract: We introduce a generic strategy for provably efficient multi-goal exploration. It relies on AdaGoal, a novel goal selection scheme that leverages a measure of uncertainty in reaching states to adaptively target goals that are neither too difficult nor too easy. We show how AdaGoal can be used to tackle the objective of learning an $ε$-optimal goal-conditioned policy for the (initially unknown) set… ▽ More

    Submitted 24 February, 2022; v1 submitted 23 November, 2021; originally announced November 2021.

    Comments: AISTATS 2022

  13. arXiv:2106.10166  [pdf, other

    stat.ML cs.LG

    Problem Dependent View on Structured Thresholding Bandit Problems

    Authors: James Cheshire, Pierre Ménard, Alexandra Carpentier

    Abstract: We investigate the problem dependent regime in the stochastic Thresholding Bandit problem (TBP) under several shape constraints. In the TBP, the objective of the learner is to output, at the end of a sequential game, the set of arms whose means are above a given threshold. The vanilla, unstructured, case is already well studied in the literature. Taking $K$ as the number of arms, we consider the c… ▽ More

    Submitted 18 June, 2021; originally announced June 2021.

    Comments: 25 pages. arXiv admin note: text overlap with arXiv:2006.10006

  14. arXiv:2106.06279  [pdf, ps, other

    stat.ML cs.LG

    Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall

    Authors: Tadashi Kozuno, Pierre Ménard, Rémi Munos, Michal Valko

    Abstract: We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular IIG under the perfect-recall assumption where the only feedback is realizations of the game (bandit feedback). In particular, the dynamic of the IIG is not known -- we can only access it by sampling or interacting with a g… ▽ More

    Submitted 11 June, 2021; originally announced June 2021.

    Comments: 20 pages

  15. arXiv:2103.12452  [pdf, other

    cs.LG stat.ML

    Bandits with many optimal arms

    Authors: Rianne de Heide, James Cheshire, Pierre Ménard, Alexandra Carpentier

    Abstract: We consider a stochastic bandit problem with a possibly infinite number of arms. We write $p^*$ for the proportion of optimal arms and $Δ$ for the minimal mean-gap between optimal and sub-optimal arms. We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting in terms of the problem parameters $T$ (the budget), $p^*$ and $Δ$. For t… ▽ More

    Submitted 5 November, 2021; v1 submitted 23 March, 2021; originally announced March 2021.

    Comments: Substantial rewrite and added experiments. Accepted for NeurIPS 2021

  16. arXiv:2103.01312  [pdf, other

    stat.ML cs.LG

    UCB Momentum Q-learning: Correcting the bias without forgetting

    Authors: Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, Michal Valko

    Abstract: We propose UCBMQ, Upper Confidence Bound Momentum Q-learning, a new algorithm for reinforcement learning in tabular and possibly stage-dependent, episodic Markov decision process. UCBMQ is based on Q-learning where we add a momentum term and rely on the principle of optimism in face of uncertainty to deal with exploration. Our new technical ingredient of UCBMQ is the use of momentum to correct the… ▽ More

    Submitted 18 March, 2022; v1 submitted 1 March, 2021; originally announced March 2021.

  17. arXiv:2010.03531  [pdf, ps, other

    cs.LG stat.ML

    Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited

    Authors: Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, Michal Valko

    Abstract: In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a novel lower bound of $Ω((H^3SA/ε^2)\log(1/δ))$ on the sample complexity of an $(\varepsilon,δ)$-PAC algorithm for best poli… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

  18. arXiv:2007.13442  [pdf, other

    cs.LG stat.ML

    Fast active learning for pure exploration in reinforcement learning

    Authors: Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, Michal Valko

    Abstract: Realistic environments often provide agents with very limited feedback. When the environment is initially unknown, the feedback, in the beginning, can be completely absent, and the agents may first choose to devote all their effort on exploring efficiently. The exploration remains a challenge while it has been addressed with many hand-tuned heuristics with different levels of generality on one sid… ▽ More

    Submitted 10 October, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

  19. arXiv:2007.05078  [pdf, other

    cs.LG stat.ML

    A Kernel-Based Approach to Non-Stationary Reinforcement Learning in Metric Spaces

    Authors: Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko

    Abstract: In this work, we propose KeRNS: an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes (MDPs) whose state-action set is endowed with a metric. Using a non-parametric model of the MDP built with time-dependent kernels, we prove a regret bound that scales with the covering dimension of the state-action space and the total variation of the MDP with time, which qu… ▽ More

    Submitted 23 March, 2022; v1 submitted 9 July, 2020; originally announced July 2020.

    Comments: Update following the publication in AISTATS 2021. Fixed typos and lemma about runtime

  20. arXiv:2007.03224  [pdf, other

    cs.IT stat.ML

    Optimal Strategies for Graph-Structured Bandits

    Authors: Hassan Saber, Pierre Ménard, Odalric-Ambrym Maillard

    Abstract: We study a structured variant of the multi-armed bandit problem specified by a set of Bernoulli distributions $ ν\!= \!(ν\_{a,b})\_{a \in \mathcal{A}, b \in \mathcal{B}}$ with means $(μ\_{a,b})\_{a \in \mathcal{A}, b \in \mathcal{B}}\!\in\![0,1]^{\mathcal{A}\times\mathcal{B}}$ and by a given weight matrix $ω\!=\! (ω\_{b,b'})\_{b,b' \in \mathcal{B}}$, where $ \mathcal{A}$ is a finite set of arms… ▽ More

    Submitted 10 July, 2020; v1 submitted 7 July, 2020; originally announced July 2020.

  21. arXiv:2007.00953  [pdf, other

    stat.ML cs.LG

    Gamification of Pure Exploration for Linear Bandits

    Authors: Rémy Degenne, Pierre Ménard, Xuedong Shang, Michal Valko

    Abstract: We investigate an active pure-exploration setting, that includes best-arm identification, in the context of linear stochastic bandits. While asymptotically optimal algorithms exist for standard multi-arm bandits, the existence of such algorithms for the best-arm identification in linear bandits has been elusive despite several attempts to address it. First, we provide a thorough comparison and new… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

    Comments: 11+25 pages. To be published in the proceedings of ICML 2020

  22. arXiv:2006.16569  [pdf, other

    cs.LG cs.IT stat.ML

    Forced-exploration free Strategies for Unimodal Bandits

    Authors: Hassan Saber, Pierre Ménard, Odalric-Ambrym Maillard

    Abstract: We consider a multi-armed bandit problem specified by a set of Gaussian or Bernoulli distributions endowed with a unimodal structure. Although this problem has been addressed in the literature (Combes and Proutiere, 2014), the state-of-the-art algorithms for such structure make appear a forced-exploration mechanism. We introduce IMED-UB, the first forced-exploration free strategy that exploits the… ▽ More

    Submitted 30 June, 2020; originally announced June 2020.

  23. arXiv:2006.10006  [pdf, ps, other

    cs.LG stat.ML

    The Influence of Shape Constraints on the Thresholding Bandit Problem

    Authors: James Cheshire, Pierre Menard, Alexandra Carpentier

    Abstract: We investigate the stochastic Thresholding Bandit problem (TBP) under several shape constraints. On top of (i) the vanilla, unstructured TBP, we consider the case where (ii) the sequence of arm's means $(μ_k)_k$ is monotonically increasing MTBP, (iii) the case where $(μ_k)_k$ is unimodal UTBP and (iv) the case where $(μ_k)_k$ is concave CTBP. In the TBP problem the aim is to output, at the end of… ▽ More

    Submitted 23 February, 2021; v1 submitted 17 June, 2020; originally announced June 2020.

  24. arXiv:2006.06294  [pdf, other

    cs.LG stat.ML

    Adaptive Reward-Free Exploration

    Authors: Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, Michal Valko

    Abstract: Reward-free exploration is a reinforcement learning setting studied by ** et al. (2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead give a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be… ▽ More

    Submitted 7 October, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

  25. arXiv:2006.05879  [pdf, other

    cs.LG stat.ML

    Planning in Markov Decision Processes with Gap-Dependent Sample Complexity

    Authors: Anders Jonsson, Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Edouard Leurent, Michal Valko

    Abstract: We propose MDP-GapE, a new trajectory-based Monte-Carlo Tree Search algorithm for planning in a Markov Decision Process in which transitions have a finite support. We prove an upper bound on the number of calls to the generative models needed for MDP-GapE to identify a near-optimal action with high probability. This problem-dependent sample complexity result is expressed in terms of the sub-optima… ▽ More

    Submitted 10 June, 2020; originally announced June 2020.

  26. arXiv:2004.05599  [pdf, other

    cs.LG stat.ML

    Kernel-Based Reinforcement Learning: A Finite-Time Analysis

    Authors: Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko

    Abstract: We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. For problems with $K$ epi… ▽ More

    Submitted 23 March, 2022; v1 submitted 12 April, 2020; originally announced April 2020.

    Comments: Update following the publication in ICML 2021, including fixed typos

  27. arXiv:1910.10945  [pdf, other

    cs.LG stat.ML

    Fixed-Confidence Guarantees for Bayesian Best-Arm Identification

    Authors: Xuedong Shang, Rianne de Heide, Emilie Kaufmann, Pierre Ménard, Michal Valko

    Abstract: We investigate and provide new insights on the sampling rule called Top-Two Thompson Sampling (TTTS). In particular, we justify its use for fixed-confidence best-arm identification. We further propose a variant of TTTS called Top-Two Transportation Cost (T3C), which disposes of the computational burden of TTTS. As our main contribution, we provide the first sample complexity analysis of TTTS and T… ▽ More

    Submitted 28 October, 2019; v1 submitted 24 October, 2019; originally announced October 2019.

  28. arXiv:1906.10431  [pdf, other

    stat.ML cs.LG

    Non-Asymptotic Pure Exploration by Solving Games

    Authors: Rémy Degenne, Wouter M. Koolen, Pierre Ménard

    Abstract: Pure exploration (aka active testing) is the fundamental task of sequentially gathering information to answer a query about a stochastic environment. Good algorithms make few mistakes and take few samples. Lower bounds (for multi-armed bandit models with arms in an exponential family) reveal that the sample complexity is determined by the solution to an optimisation problem. The existing state of… ▽ More

    Submitted 25 June, 2019; originally announced June 2019.

  29. arXiv:1905.08165  [pdf, other

    stat.ML cs.LG

    Gradient Ascent for Active Exploration in Bandit Problems

    Authors: Pierre Ménard

    Abstract: We present a new algorithm based on an gradient ascent for a general Active Exploration bandit problem in the fixed confidence setting. This problem encompasses several well studied problems such that the Best Arm Identification or Thresholding Bandits. It consists of a new sampling rule based on an online lazy mirror ascent. We prove that this algorithm is asymptotically optimal and, most importa… ▽ More

    Submitted 20 May, 2019; originally announced May 2019.

    Comments: 21 pages, 1 figure

  30. arXiv:1805.05071  [pdf, other

    stat.ML cs.LG math.ST

    KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints

    Authors: Aurélien Garivier, Hédi Hadiji, Pierre Menard, Gilles Stoltz

    Abstract: We consider $K$-armed stochastic bandits and consider cumulative regret bounds up to time $T$. We are interested in strategies achieving simultaneously a distribution-free regret bound of optimal order $\sqrt{KT}$ and a distribution-dependent regret that is asymptotically optimal, that is, matching the $κ\ln T$ lower bound by Lai and Robbins (1985) and Burnetas and Katehakis (1996), where $κ$ is t… ▽ More

    Submitted 1 July, 2022; v1 submitted 14 May, 2018; originally announced May 2018.

  31. arXiv:1702.07211  [pdf, ps, other

    stat.ML cs.LG math.ST

    A minimax and asymptotically optimal algorithm for stochastic bandits

    Authors: Pierre Ménard, Aurélien Garivier

    Abstract: We propose the kl-UCB ++ algorithm for regret minimization in stochastic bandit models with exponential families of distributions. We prove that it is simultaneously asymptotically optimal (in the sense of Lai and Robbins' lower bound) and minimax optimal. This is the first algorithm proved to enjoy these two properties at the same time. This work thus merges two different lines of research with s… ▽ More

    Submitted 20 September, 2017; v1 submitted 23 February, 2017; originally announced February 2017.

    Journal ref: Algorithmic Learning Theory, Springer, 2017, 2017 Algorithmic Learning Theory Conference 76

  32. arXiv:1702.05985  [pdf, other

    math.ST cs.IT

    Fano's inequality for random variables

    Authors: Sebastien Gerchinovitz, Pierre Ménard, Gilles Stoltz

    Abstract: We extend Fano's inequality, which controls the average probability of events in terms of the average of some $f$--divergences, to work with arbitrary events (not necessarily forming a partition) and even with arbitrary $[0,1]$--valued random variables, possibly in continuously infinite number. We provide two applications of these extensions, in which the consideration of random variables is parti… ▽ More

    Submitted 10 June, 2019; v1 submitted 20 February, 2017; originally announced February 2017.

  33. arXiv:1602.07182  [pdf, other

    math.ST cs.LG

    Explore First, Exploit Next: The True Shape of Regret in Bandit Problems

    Authors: Aurélien Garivier, Pierre Ménard, Gilles Stoltz

    Abstract: We revisit lower bounds on the regret in the case of multi-armed bandit problems. We obtain non-asymptotic, distribution-dependent bounds and provide straightforward proofs based only on well-known properties of Kullback-Leibler divergences. These bounds show in particular that in an initial phase the regret grows almost linearly, and that the well-known logarithmic growth of the regret only holds… ▽ More

    Submitted 13 October, 2018; v1 submitted 23 February, 2016; originally announced February 2016.