Skip to main content

Showing 1–42 of 42 results for author: Pirotta, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.13097  [pdf, other

    cs.LG cs.AI

    Simple Ingredients for Offline Reinforcement Learning

    Authors: Edoardo Cetin, Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric, Yann Ollivier, Ahmed Touati

    Abstract: Offline reinforcement learning algorithms have proven effective on datasets highly connected to the target downstream task. Yet, leveraging a novel testbed (MOOD) in which trajectories come from heterogeneous sources, we show that existing methods struggle with diverse data: their performance considerably deteriorates as data collected for related but different tasks is simply added to the offline… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

  2. arXiv:2302.03789  [pdf, ps, other

    cs.LG

    Layered State Discovery for Incremental Autonomous Exploration

    Authors: Liyu Chen, Andrea Tirinzoni, Alessandro Lazaric, Matteo Pirotta

    Abstract: We study the autonomous exploration (AX) problem proposed by Lim & Auer (2012). In this setting, the objective is to discover a set of $ε$-optimal policies reaching a set $\mathcal{S}_L^{\rightarrow}$ of incrementally $L$-controllable states. We introduce a novel layered decomposition of the set of incrementally $L$-controllable states that is based on the iterative application of a state-expansio… ▽ More

    Submitted 7 February, 2023; originally announced February 2023.

  3. arXiv:2212.09429  [pdf, ps, other

    cs.LG stat.ML

    On the Complexity of Representation Learning in Contextual Linear Bandits

    Authors: Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric

    Abstract: In contextual linear bandits, the reward function is assumed to be a linear combination of an unknown reward vector and a given embedding of context-arm pairs. In practice, the embedding is often learned at the same time as the reward vector, thus leading to an online representation learning problem. Existing approaches to representation learning in contextual bandits are either very generic (e.g.… ▽ More

    Submitted 19 December, 2022; originally announced December 2022.

  4. arXiv:2211.02233  [pdf, ps, other

    cs.LG cs.AI

    Improved Adaptive Algorithm for Scalable Active Learning with Weak Labeler

    Authors: Yifang Chen, Karthik Sankararaman, Alessandro Lazaric, Matteo Pirotta, Dmytro Karamshuk, Qifan Wang, Karishma Mandyam, Sinong Wang, Han Fang

    Abstract: Active learning with strong and weak labelers considers a practical setting where we have access to both costly but accurate strong labelers and inaccurate but cheap predictions provided by weak labelers. We study this problem in the streaming setting, where decisions must be taken \textit{online}. We design a novel algorithmic template, Weak Labeler Active Cover (WL-AC), that is able to robustly… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

  5. arXiv:2210.13083  [pdf, other

    cs.LG

    Scalable Representation Learning in Linear Contextual Bandits with Constant Regret Guarantees

    Authors: Andrea Tirinzoni, Matteo Papini, Ahmed Touati, Alessandro Lazaric, Matteo Pirotta

    Abstract: We study the problem of representation learning in stochastic contextual linear bandits. While the primary concern in this domain is usually to find realizable representations (i.e., those that allow predicting the reward function at any context-action pair exactly), it has been recently shown that representations with certain spectral properties (called HLS) may be more effective for the explorat… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: Accepted at Neurips 2022

  6. arXiv:2210.09957  [pdf, other

    cs.LG cs.AI cs.CY cs.IR stat.ML

    Contextual bandits with concave rewards, and an application to fair ranking

    Authors: Virginie Do, Elvis Dohmatob, Matteo Pirotta, Alessandro Lazaric, Nicolas Usunier

    Abstract: We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective bandit problem where the desired trade-off between the rewards is defined by a known concave objective function, and the reward vector depends on an observed stochastic context. We present the first algorithm with provably vanishing regret for CBCR without restrictions on the policy space, whereas prior works were restri… ▽ More

    Submitted 28 February, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: ICLR 2023

  7. arXiv:2210.04946  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path

    Authors: Liyu Chen, Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric

    Abstract: We study the sample complexity of learning an $ε$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{\min}$, and maximum expected cost of the optimal policy over all states $B_{\star}$, where any al… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

  8. arXiv:2112.06517  [pdf, other

    cs.LG stat.ML

    Top $K$ Ranking for Multi-Armed Bandit with Noisy Evaluations

    Authors: Evrard Garcelon, Vashist Avadhanula, Alessandro Lazaric, Matteo Pirotta

    Abstract: We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, \emph{evaluations} of the true reward of each arm and it selects $K$ arms with the objective of accumulating as much reward as possible over $T$ rounds. Under the assumption that at each round the true reward of each arm is drawn from a fixed distribution, we… ▽ More

    Submitted 12 April, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

  9. arXiv:2112.06008  [pdf, ps, other

    cs.LG

    Privacy Amplification via Shuffling for Linear Contextual Bandits

    Authors: Evrard Garcelon, Kamalika Chaudhuri, Vianney Perchet, Matteo Pirotta

    Abstract: Contextual bandit algorithms are widely used in domains where it is desirable to provide a personalized service by leveraging contextual information, that may contain sensitive information that needs to be protected. Inspired by this scenario, we study the contextual linear bandit problem with differential privacy (DP) constraints. While the literature has focused on either centralized (joint DP)… ▽ More

    Submitted 11 December, 2021; originally announced December 2021.

  10. arXiv:2112.01585  [pdf, ps, other

    cs.LG

    Differentially Private Exploration in Reinforcement Learning with Linear Representation

    Authors: Paul Luyo, Evrard Garcelon, Alessandro Lazaric, Matteo Pirotta

    Abstract: This paper studies privacy-preserving exploration in Markov Decision Processes (MDPs) with linear representation. We first consider the setting of linear-mixture MDPs (Ayoub et al., 2020) (a.k.a.\ model-based setting) and provide an unified framework for analyzing joint and local differential private (DP) exploration. Through this framework, we prove a $\widetilde{O}(K^{3/4}/\sqrtε)$ regret bound… ▽ More

    Submitted 6 December, 2021; v1 submitted 2 December, 2021; originally announced December 2021.

  11. arXiv:2111.12045  [pdf, other

    cs.LG

    Adaptive Multi-Goal Exploration

    Authors: Jean Tarbouriech, Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Michal Valko, Alessandro Lazaric

    Abstract: We introduce a generic strategy for provably efficient multi-goal exploration. It relies on AdaGoal, a novel goal selection scheme that leverages a measure of uncertainty in reaching states to adaptively target goals that are neither too difficult nor too easy. We show how AdaGoal can be used to tackle the objective of learning an $ε$-optimal goal-conditioned policy for the (initially unknown) set… ▽ More

    Submitted 24 February, 2022; v1 submitted 23 November, 2021; originally announced November 2021.

    Comments: AISTATS 2022

  12. arXiv:2110.14798  [pdf, other

    cs.LG

    Reinforcement Learning in Linear MDPs: Constant Regret and Representation Selection

    Authors: Matteo Papini, Andrea Tirinzoni, Aldo Pacchiano, Marcello Restelli, Alessandro Lazaric, Matteo Pirotta

    Abstract: We study the role of the representation of state-action value functions in regret minimization in finite-horizon Markov Decision Processes (MDPs) with linear structure. We first derive a necessary condition on the representation, called universally spanning optimal features (UNISOFT), to achieve constant regret in any MDP with linear reward function. This result encompasses the well-known settings… ▽ More

    Submitted 27 October, 2021; originally announced October 2021.

    Comments: Accepted at NeurIPS 2021

  13. arXiv:2106.13013  [pdf, ps, other

    cs.LG

    A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs

    Authors: Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric

    Abstract: We derive a novel asymptotic problem-dependent lower-bound for regret minimization in finite-horizon tabular Markov Decision Processes (MDPs). While, similar to prior work (e.g., for ergodic MDPs), the lower-bound is the solution to an optimization problem, our derivation reveals the need for an additional constraint on the visitation distribution over state-action pairs that explicitly accounts f… ▽ More

    Submitted 24 June, 2021; originally announced June 2021.

  14. arXiv:2106.11692  [pdf, ps, other

    cs.LG stat.ML

    A Reduction-Based Framework for Conservative Bandits and Reinforcement Learning

    Authors: Yunchang Yang, Tianhao Wu, Han Zhong, Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric, Liwei Wang, Simon S. Du

    Abstract: In this paper, we present a reduction-based framework for conservative bandits and RL, in which our core technique is to calculate the necessary and sufficient budget obtained from running the baseline policy. For lower bounds, we improve the existing lower bound for conservative multi-armed bandits and obtain new lower bounds for conservative linear bandits, tabular RL and low-rank MDP, through a… ▽ More

    Submitted 16 March, 2022; v1 submitted 22 June, 2021; originally announced June 2021.

  15. arXiv:2104.11186  [pdf, other

    cs.LG

    Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

    Authors: Jean Tarbouriech, Runlong Zhou, Simon S. Du, Matteo Pirotta, Michal Valko, Alessandro Lazaric

    Abstract: We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to induce an optimistic SSP problem whose associated value iteration schem… ▽ More

    Submitted 10 December, 2021; v1 submitted 22 April, 2021; originally announced April 2021.

    Comments: NeurIPS 2021

  16. arXiv:2104.03781  [pdf, other

    cs.LG

    Leveraging Good Representations in Linear Contextual Bandits

    Authors: Matteo Papini, Andrea Tirinzoni, Marcello Restelli, Alessandro Lazaric, Matteo Pirotta

    Abstract: The linear contextual bandit literature is mostly focused on the design of efficient learning algorithms for a given representation. However, a contextual bandit problem may admit multiple linear representations, each one with different characteristics that directly impact the regret of the learning algorithm. In particular, recent works showed that there exist "good" representations for which con… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

  17. arXiv:2103.09927  [pdf, other

    cs.LG

    Encrypted Linear Contextual Bandit

    Authors: Evrard Garcelon, Vianney Perchet, Matteo Pirotta

    Abstract: Contextual bandit is a general framework for online learning in sequential decision-making problems that has found application in a wide range of domains, including recommendation systems, online advertising, and clinical trials. A critical aspect of bandit methods is that they require to observe the contexts --i.e., individual or group-level data-- and rewards in order to solve the sequential p… ▽ More

    Submitted 23 March, 2022; v1 submitted 17 March, 2021; originally announced March 2021.

  18. arXiv:2012.14755  [pdf, other

    cs.LG stat.ML

    Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

    Authors: Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric

    Abstract: We investigate the exploration of an unknown environment when no reward function is provided. Building on the incremental exploration setting introduced by Lim and Auer [1], we define the objective of learning the set of $ε$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps (in expectation) from a reference state $s_0$. In this paper, we intro… ▽ More

    Submitted 29 December, 2020; originally announced December 2020.

    Comments: NeurIPS 2020

  19. arXiv:2010.12247  [pdf, other

    cs.LG stat.ML

    An Asymptotically Optimal Primal-Dual Incremental Algorithm for Contextual Linear Bandits

    Authors: Andrea Tirinzoni, Matteo Pirotta, Marcello Restelli, Alessandro Lazaric

    Abstract: In the contextual linear bandit setting, algorithms built on the optimism principle fail to exploit the structure of the problem and have been shown to be asymptotically suboptimal. In this paper, we follow recent approaches of deriving asymptotically optimal algorithms from problem-dependent regret lower bounds and we introduce a novel algorithm improving over the state-of-the-art along multiple… ▽ More

    Submitted 20 November, 2020; v1 submitted 23 October, 2020; originally announced October 2020.

    Comments: To appear at NeurIPS 2020. V2: clarified dependencies in the worst-case regret bound

  20. arXiv:2010.07778  [pdf, other

    cs.LG

    Local Differential Privacy for Regret Minimization in Reinforcement Learning

    Authors: Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, Matteo Pirotta

    Abstract: Reinforcement learning algorithms are widely used in domains where it is desirable to provide a personalized service. In these domains it is common that user data contains sensitive information that needs to be protected from third parties. Motivated by this, we study privacy in the context of finite-horizon Markov Decision Processes (MDPs) by requiring information to be obfuscated on the user sid… ▽ More

    Submitted 27 October, 2021; v1 submitted 15 October, 2020; originally announced October 2020.

  21. arXiv:2007.06437  [pdf, other

    cs.LG stat.ML

    A Provably Efficient Sample Collection Strategy for Reinforcement Learning

    Authors: Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric

    Abstract: One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior. Whether we optimize for regret, sample complexity, state-space coverage or model estimation, we need to strike a different exploration-exploitation trade-off. In this paper, we propose to tackle the explora… ▽ More

    Submitted 18 November, 2021; v1 submitted 13 July, 2020; originally announced July 2020.

    Comments: NeurIPS 2021

  22. arXiv:2007.05456  [pdf, ps, other

    cs.LG stat.ML

    Improved Analysis of UCRL2 with Empirical Bernstein Inequality

    Authors: Ronan Fruit, Matteo Pirotta, Alessandro Lazaric

    Abstract: We consider the problem of exploration-exploitation in communicating Markov Decision Processes. We provide an analysis of UCRL2 with Empirical Bernstein inequalities (UCRL2B). For any MDP with $S$ states, $A$ actions, $Γ\leq S$ next states and diameter $D$, the regret of UCRL2B is bounded as $\widetilde{O}(\sqrt{DΓS A T})$.

    Submitted 10 July, 2020; originally announced July 2020.

    Comments: Document in support of the tutorial at ALT 2019

  23. arXiv:2007.05078  [pdf, other

    cs.LG stat.ML

    A Kernel-Based Approach to Non-Stationary Reinforcement Learning in Metric Spaces

    Authors: Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko

    Abstract: In this work, we propose KeRNS: an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes (MDPs) whose state-action set is endowed with a metric. Using a non-parametric model of the MDP built with time-dependent kernels, we prove a regret bound that scales with the covering dimension of the state-action space and the total variation of the MDP with time, which qu… ▽ More

    Submitted 23 March, 2022; v1 submitted 9 July, 2020; originally announced July 2020.

    Comments: Update following the publication in AISTATS 2021. Fixed typos and lemma about runtime

  24. arXiv:2005.02934  [pdf, other

    cs.LG cs.AI stat.ML

    Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization

    Authors: Pierre-Alexandre Kamienny, Matteo Pirotta, Alessandro Lazaric, Thibault Lavril, Nicolas Usunier, Ludovic Denoyer

    Abstract: We study the problem of learning exploration-exploitation strategies that effectively adapt to dynamic environments, where the task may change over time. While RNN-based policies could in principle represent such strategies, in practice their training time is prohibitive and the learning process often converges to poor solutions. In this paper, we consider the case where the agent has access to a… ▽ More

    Submitted 6 May, 2020; originally announced May 2020.

    Comments: 18 pages

    MSC Class: 68T99

  25. arXiv:2004.05599  [pdf, other

    cs.LG stat.ML

    Kernel-Based Reinforcement Learning: A Finite-Time Analysis

    Authors: Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko

    Abstract: We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. For problems with $K$ epi… ▽ More

    Submitted 23 March, 2022; v1 submitted 12 April, 2020; originally announced April 2020.

    Comments: Update following the publication in ICML 2021, including fixed typos

  26. arXiv:2003.03297  [pdf, other

    stat.ML cs.LG

    Active Model Estimation in Markov Decision Processes

    Authors: Jean Tarbouriech, Shubhanshu Shekhar, Matteo Pirotta, Mohammad Ghavamzadeh, Alessandro Lazaric

    Abstract: We study the problem of efficient exploration in order to learn an accurate model of an environment, modeled as a Markov decision process (MDP). Efficient exploration in this problem requires the agent to identify the regions in which estimating the model is more difficult and then exploit this knowledge to collect more samples there. In this paper, we formalize this problem, introduce the first a… ▽ More

    Submitted 22 June, 2020; v1 submitted 6 March, 2020; originally announced March 2020.

  27. arXiv:2003.02189  [pdf, ps, other

    cs.LG stat.ML

    Exploration-Exploitation in Constrained MDPs

    Authors: Yonathan Efroni, Shie Mannor, Matteo Pirotta

    Abstract: In many sequential decision-making problems, the goal is to optimize a utility function while satisfying a set of constraints on different utilities. This learning problem is formalized through Constrained Markov Decision Processes (CMDPs). In this paper, we investigate the exploration-exploitation dilemma in CMDPs. While learning in an unknown CMDP, an agent should trade-off exploration to discov… ▽ More

    Submitted 4 March, 2020; originally announced March 2020.

  28. arXiv:2002.03839  [pdf, other

    cs.LG stat.ML

    Adversarial Attacks on Linear Contextual Bandits

    Authors: Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, Matteo Pirotta

    Abstract: Contextual bandit algorithms are applied in a wide range of domains, from advertising to recommender systems, from clinical trials to education. In many of these domains, malicious agents may have incentives to attack the bandit algorithm to induce it to perform a desired behavior. For instance, an unscrupulous ad publisher may try to increase their own revenue at the expense of the advertisers; a… ▽ More

    Submitted 23 October, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

  29. arXiv:2002.03221  [pdf, other

    cs.LG stat.ML

    Improved Algorithms for Conservative Exploration in Bandits

    Authors: Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta

    Abstract: In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better… ▽ More

    Submitted 8 February, 2020; originally announced February 2020.

  30. arXiv:2002.03218  [pdf, other

    cs.LG stat.ML

    Conservative Exploration in Reinforcement Learning

    Authors: Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta

    Abstract: While learning in an unknown Markov Decision Process (MDP), an agent should trade off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward. Although the agent will eventually learn a good or optimal policy, there is no guarantee on the quality of the intermediate policies. This lack of control is undesired in real-world application… ▽ More

    Submitted 15 July, 2020; v1 submitted 8 February, 2020; originally announced February 2020.

    Comments: AISTATS 2020

  31. arXiv:2001.11595  [pdf, ps, other

    cs.LG stat.ML

    Concentration Inequalities for Multinoulli Random Variables

    Authors: Jian Qian, Ronan Fruit, Matteo Pirotta, Alessandro Lazaric

    Abstract: We investigate concentration inequalities for Dirichlet and Multinomial random variables.

    Submitted 30 January, 2020; originally announced January 2020.

    Comments: Tutorial at ALT'19 on Regret Minimization in Infinite-Horizon Finite Markov Decision Processes

  32. arXiv:2001.04418  [pdf, other

    cs.AI

    Exploiting Language Instructions for Interpretable and Compositional Reinforcement Learning

    Authors: Michiel van der Meer, Matteo Pirotta, Elia Bruni

    Abstract: In this work, we present an alternative approach to making an agent compositional through the use of a diagnostic classifier. Because of the need for explainable agents in automated decision processes, we attempt to interpret the latent space from an RL agent to identify its current objective in a complex language instruction. Results show that the classification process causes changes in the hidd… ▽ More

    Submitted 13 January, 2020; originally announced January 2020.

    Comments: 10 pages, 5 figures

  33. arXiv:1912.03517  [pdf, other

    stat.ML cs.LG

    No-Regret Exploration in Goal-Oriented Reinforcement Learning

    Authors: Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, Alessandro Lazaric

    Abstract: Many popular reinforcement learning problems (e.g., navigation in a maze, some Atari games, mountain car) are instances of the episodic setting under its stochastic shortest path (SSP) formulation, where an agent has to achieve a goal state while minimizing the cumulative cost. Despite the popularity of this setting, the exploration-exploitation dilemma has been sparsely studied in general SSP pro… ▽ More

    Submitted 17 August, 2020; v1 submitted 7 December, 2019; originally announced December 2019.

    Journal ref: International Conference on Machine Learning (ICML 2020)

  34. arXiv:1911.00567  [pdf, ps, other

    cs.LG stat.ML

    Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

    Authors: Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, Alessandro Lazaric

    Abstract: We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL). When the state space is large or continuous, traditional tabular approaches are unfeasible and some form of function approximation is mandatory. In this paper, we introduce an optimistically-initialized variant of the popular randomized least-squares value iteration (RLSVI), a model-free algorithm where… ▽ More

    Submitted 8 September, 2023; v1 submitted 1 November, 2019; originally announced November 2019.

    Comments: Minor bug fixes

  35. arXiv:1905.03231  [pdf, other

    cs.LG stat.ML

    Smoothing Policies and Safe Policy Gradients

    Authors: Matteo Papini, Matteo Pirotta, Marcello Restelli

    Abstract: Policy Gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address… ▽ More

    Submitted 17 June, 2022; v1 submitted 8 May, 2019; originally announced May 2019.

  36. arXiv:1812.04363  [pdf, ps, other

    cs.LG stat.ML

    Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes

    Authors: Jian Qian, Ronan Fruit, Matteo Pirotta, Alessandro Lazaric

    Abstract: We introduce and analyse two algorithms for exploration-exploitation in discrete and continuous Markov Decision Processes (MDPs) based on exploration bonuses. SCAL$^+$ is a variant of SCAL (Fruit et al., 2018) that performs efficient exploration-exploitation in any unknown weakly-communicating MDP for which an upper bound C on the span of the optimal bias function is known. For an MDP with $S$ sta… ▽ More

    Submitted 11 December, 2018; originally announced December 2018.

  37. arXiv:1807.02373  [pdf, other

    cs.LG stat.ML

    Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

    Authors: Ronan Fruit, Matteo Pirotta, Alessandro Lazaric

    Abstract: While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This leads to defining weakly-communicating or multi-chain MDPs. In this paper, we introduce \tucrl, the first algorithm able to perform efficient explo… ▽ More

    Submitted 20 March, 2019; v1 submitted 6 July, 2018; originally announced July 2018.

  38. arXiv:1806.05618  [pdf, other

    cs.LG stat.ML

    Stochastic Variance-Reduced Policy Gradient

    Authors: Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, Marcello Restelli

    Abstract: In this paper, we propose a novel reinforcement- learning algorithm consisting in a stochastic variance-reduced version of policy gradient for solving Markov Decision Processes (MDPs). Stochastic variance-reduced gradient (SVRG) methods have proven to be very successful in supervised learning. However, their adaptation to policy gradient is not straightforward and needs to account for I) a non-con… ▽ More

    Submitted 14 June, 2018; originally announced June 2018.

    Journal ref: Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018

  39. arXiv:1805.10886  [pdf, other

    cs.LG stat.ML

    Importance Weighted Transfer of Samples in Reinforcement Learning

    Authors: Andrea Tirinzoni, Andrea Sessa, Matteo Pirotta, Marcello Restelli

    Abstract: We consider the transfer of experience samples (i.e., tuples < s, a, s', r >) in reinforcement learning (RL), collected from a set of source tasks to improve the learning process in a given target task. Most of the related approaches focus on selecting the most relevant source samples for solving the target task, but then all the transferred samples are used without considering anymore the discrep… ▽ More

    Submitted 28 May, 2018; originally announced May 2018.

    Comments: Accepted at ICML 2018

  40. arXiv:1802.04020  [pdf, other

    cs.LG stat.ML

    Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

    Authors: Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, Ronald Ortner

    Abstract: We introduce SCAL, an algorithm designed to perform efficient exploration-exploitation in any unknown weakly-communicating Markov decision process (MDP) for which an upper bound $c$ on the span of the optimal bias function is known. For an MDP with $S$ states, $A$ actions and $Γ\leq S$ possible next states, we prove a regret bound of $\widetilde{O}(c\sqrt{ΓSAT})$, which significantly improves over… ▽ More

    Submitted 6 July, 2018; v1 submitted 12 February, 2018; originally announced February 2018.

  41. arXiv:1712.03428  [pdf, other

    cs.LG stat.ML

    Cost-Sensitive Approach to Batch Size Adaptation for Gradient Descent

    Authors: Matteo Pirotta, Marcello Restelli

    Abstract: In this paper, we propose a novel approach to automatically determine the batch size in stochastic gradient descent methods. The choice of the batch size induces a trade-off between the accuracy of the gradient estimate and the cost in terms of samples of each update. We propose to determine the batch size by optimizing the ratio between a lower bound to a linear or quadratic Taylor approximation… ▽ More

    Submitted 9 December, 2017; originally announced December 2017.

    Comments: Presented at the NIPS workshop on Optimizing the Optimizers. Barcelona, Spain, 2016

  42. arXiv:1406.3497  [pdf, other

    cs.AI cs.LG

    Multi-objective Reinforcement Learning with Continuous Pareto Frontier Approximation Supplementary Material

    Authors: Matteo Pirotta, Simone Parisi, Marcello Restelli

    Abstract: This document contains supplementary material for the paper "Multi-objective Reinforcement Learning with Continuous Pareto Frontier Approximation", published at the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15). The paper is about learning a continuous approximation of the Pareto frontier in Multi-Objective Markov Decision Problems (MOMDPs). We propose a policy-based approach t… ▽ More

    Submitted 18 November, 2014; v1 submitted 13 June, 2014; originally announced June 2014.

    Comments: AAAI-15 Supplement. Updated upon acceptance at the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15)