Search | arXiv e-print repository

Strategic Linear Contextual Bandits

Authors: Thomas Kleine Buening, Aadirupa Saha, Christos Dimitrakakis, Haifeng Xu

Abstract: Motivated by the phenomenon of strategic agents gaming a recommender system to maximize the number of times they are recommended to users, we study a strategic variant of the linear contextual bandit problem, where the arms can strategically misreport their privately observed contexts to the learner. We treat the algorithm design problem as one of mechanism design under uncertainty and propose the… ▽ More Motivated by the phenomenon of strategic agents gaming a recommender system to maximize the number of times they are recommended to users, we study a strategic variant of the linear contextual bandit problem, where the arms can strategically misreport their privately observed contexts to the learner. We treat the algorithm design problem as one of mechanism design under uncertainty and propose the Optimistic Grim Trigger Mechanism (OptGTM) that incentivizes the agents (i.e., arms) to report their contexts truthfully while simultaneously minimizing regret. We also show that failing to account for the strategic nature of the agents results in linear regret. However, a trade-off between mechanism design and regret minimization appears to be unavoidable. More broadly, this work aims to provide insight into the intersection of online learning and mechanism design. △ Less

Submitted 1 June, 2024; originally announced June 2024.

arXiv:2312.11663 [pdf, other]

Eliciting Kemeny Rankings

Authors: Anne-Marie George, Christos Dimitrakakis

Abstract: We formulate the problem of eliciting agents' preferences with the goal of finding a Kemeny ranking as a Dueling Bandits problem. Here the bandits' arms correspond to alternatives that need to be ranked and the feedback corresponds to a pairwise comparison between alternatives by a randomly sampled agent. We consider both sampling with and without replacement, i.e., the possibility to ask the same… ▽ More We formulate the problem of eliciting agents' preferences with the goal of finding a Kemeny ranking as a Dueling Bandits problem. Here the bandits' arms correspond to alternatives that need to be ranked and the feedback corresponds to a pairwise comparison between alternatives by a randomly sampled agent. We consider both sampling with and without replacement, i.e., the possibility to ask the same agent about some comparison multiple times or not. We find approximation bounds for Kemeny rankings dependant on confidence intervals over estimated winning probabilities of arms. Based on these we state algorithms to find Probably Approximately Correct (PAC) solutions and elaborate on their sample complexity for sampling with or without replacement. Furthermore, if all agents' preferences are strict rankings over the alternatives, we provide means to prune confidence intervals and thereby guide a more efficient elicitation. We formulate several adaptive sampling methods that use look-aheads to estimate how much confidence intervals (and thus approximation guarantees) might be tightened. All described methods are compared on synthetic data. △ Less

Submitted 18 December, 2023; originally announced December 2023.

Comments: This is a long version of the AAAI'24 publication under the same title

arXiv:2311.15647 [pdf, other]

Bandits Meet Mechanism Design to Combat Clickbait in Online Recommendation

Authors: Thomas Kleine Buening, Aadirupa Saha, Christos Dimitrakakis, Haifeng Xu

Abstract: We study a strategic variant of the multi-armed bandit problem, which we coin the strategic click-bandit. This model is motivated by applications in online recommendation where the choice of recommended items depends on both the click-through rates and the post-click rewards. Like in classical bandits, rewards follow a fixed unknown distribution. However, we assume that the click-rate of each arm… ▽ More We study a strategic variant of the multi-armed bandit problem, which we coin the strategic click-bandit. This model is motivated by applications in online recommendation where the choice of recommended items depends on both the click-through rates and the post-click rewards. Like in classical bandits, rewards follow a fixed unknown distribution. However, we assume that the click-rate of each arm is chosen strategically by the arm (e.g., a host on Airbnb) in order to maximize the number of times it gets clicked. The algorithm designer does not know the post-click rewards nor the arms' actions (i.e., strategically chosen click-rates) in advance, and must learn both values over time. To solve this problem, we design an incentive-aware learning algorithm, UCB-S, which achieves two goals simultaneously: (a) incentivizing desirable arm behavior under uncertainty; (b) minimizing regret by learning unknown parameters. We characterize all approximate Nash equilibria among arms under UCB-S and show a $\tilde{\mathcal{O}} (\sqrt{KT})$ regret bound uniformly in every equilibrium. We also show that incentive-unaware algorithms generally fail to achieve low regret in the strategic click-bandit. Finally, we support our theoretical results by simulations of strategic arm behavior which confirm the effectiveness and robustness of our proposed incentive design. △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2302.10831 [pdf, other]

Minimax-Bayes Reinforcement Learning

Authors: Thomas Kleine Buening, Christos Dimitrakakis, Hannes Eriksson, Divya Grover, Emilio Jorge

Abstract: While the Bayesian decision-theoretic framework offers an elegant solution to the problem of decision making under uncertainty, one question is how to appropriately select the prior distribution. One idea is to employ a worst-case prior. However, this is not as easy to specify in sequential decision making as in simple statistical estimation problems. This paper studies (sometimes approximate) min… ▽ More While the Bayesian decision-theoretic framework offers an elegant solution to the problem of decision making under uncertainty, one question is how to appropriately select the prior distribution. One idea is to employ a worst-case prior. However, this is not as easy to specify in sequential decision making as in simple statistical estimation problems. This paper studies (sometimes approximate) minimax-Bayes solutions for various reinforcement learning problems to gain insights into the properties of the corresponding priors and policies. We find that while the worst-case prior depends on the setting, the corresponding minimax policies are more robust than those that assume a standard (i.e. uniform) prior. △ Less

Submitted 21 February, 2023; originally announced February 2023.

arXiv:2302.09273 [pdf, other]

Reinforcement Learning in the Wild with Maximum Likelihood-based Model Transfer

Authors: Hannes Eriksson, Debabrota Basu, Tommy Tram, Mina Alibeigi, Christos Dimitrakakis

Abstract: In this paper, we study the problem of transferring the available Markov Decision Process (MDP) models to learn and plan efficiently in an unknown but similar MDP. We refer to it as \textit{Model Transfer Reinforcement Learning (MTRL)} problem. First, we formulate MTRL for discrete MDPs and Linear Quadratic Regulators (LQRs) with continuous state actions. Then, we propose a generic two-stage algor… ▽ More In this paper, we study the problem of transferring the available Markov Decision Process (MDP) models to learn and plan efficiently in an unknown but similar MDP. We refer to it as \textit{Model Transfer Reinforcement Learning (MTRL)} problem. First, we formulate MTRL for discrete MDPs and Linear Quadratic Regulators (LQRs) with continuous state actions. Then, we propose a generic two-stage algorithm, MLEMTRL, to address the MTRL problem in discrete and continuous settings. In the first stage, MLEMTRL uses a \textit{constrained Maximum Likelihood Estimation (MLE)}-based approach to estimate the target MDP model using a set of known MDP models. In the second stage, using the estimated target MDP model, MLEMTRL deploys a model-based planning algorithm appropriate for the MDP class. Theoretically, we prove worst-case regret bounds for MLEMTRL both in realisable and non-realisable settings. We empirically demonstrate that MLEMTRL allows faster learning in new MDPs than learning from scratch and achieves near-optimal performance depending on the similarity of the available MDPs and the target MDP. △ Less

Submitted 18 February, 2023; originally announced February 2023.

Comments: 27 pages, 7 figures

arXiv:2210.14972 [pdf, other]

Environment Design for Inverse Reinforcement Learning

Authors: Thomas Kleine Buening, Victor Villin, Christos Dimitrakakis

Abstract: Learning a reward function from demonstrations suffers from low sample-efficiency. Even with abundant data, current inverse reinforcement learning methods that focus on learning from a single environment can fail to handle slight changes in the environment dynamics. We tackle these challenges through adaptive environment design. In our framework, the learner repeatedly interacts with the expert, w… ▽ More Learning a reward function from demonstrations suffers from low sample-efficiency. Even with abundant data, current inverse reinforcement learning methods that focus on learning from a single environment can fail to handle slight changes in the environment dynamics. We tackle these challenges through adaptive environment design. In our framework, the learner repeatedly interacts with the expert, with the former selecting environments to identify the reward function as quickly as possible from the expert's demonstrations in said environments. This results in improvements in both sample-efficiency and robustness, as we show experimentally, for both exact and approximate inference. △ Less

Submitted 14 May, 2024; v1 submitted 26 October, 2022; originally announced October 2022.

Comments: to appear at ICML 2024

arXiv:2203.10045 [pdf, other]

Risk-Sensitive Bayesian Games for Multi-Agent Reinforcement Learning under Policy Uncertainty

Authors: Hannes Eriksson, Debabrota Basu, Mina Alibeigi, Christos Dimitrakakis

Abstract: In stochastic games with incomplete information, the uncertainty is evoked by the lack of knowledge about a player's own and the other players' types, i.e. the utility function and the policy space, and also the inherent stochasticity of different players' interactions. In existing literature, the risk in stochastic games has been studied in terms of the inherent uncertainty evoked by the variabil… ▽ More In stochastic games with incomplete information, the uncertainty is evoked by the lack of knowledge about a player's own and the other players' types, i.e. the utility function and the policy space, and also the inherent stochasticity of different players' interactions. In existing literature, the risk in stochastic games has been studied in terms of the inherent uncertainty evoked by the variability of transitions and actions. In this work, we instead focus on the risk associated with the \textit{uncertainty over types}. We contrast this with the multi-agent reinforcement learning framework where the other agents have fixed stationary policies and investigate risk-sensitiveness due to the uncertainty about the other agents' adaptive policies. We propose risk-sensitive versions of existing algorithms proposed for risk-neutral stochastic games, such as Iterated Best Response (IBR), Fictitious Play (FP) and a general multi-objective gradient approach using dual ascent (DAPG). Our experimental analysis shows that risk-sensitive DAPG performs better than competing algorithms for both social welfare and general-sum stochastic games. △ Less

Submitted 18 March, 2022; originally announced March 2022.

Comments: 5 pages, 1 figure, 2 tables

arXiv:2111.04698 [pdf, other]

Interactive Inverse Reinforcement Learning for Cooperative Games

Authors: Thomas Kleine Buening, Anne-Marie George, Christos Dimitrakakis

Abstract: We study the problem of designing autonomous agents that can learn to cooperate effectively with a potentially suboptimal partner while having no access to the joint reward function. This problem is modeled as a cooperative episodic two-agent Markov decision process. We assume control over only the first of the two agents in a Stackelberg formulation of the game, where the second agent is acting s… ▽ More We study the problem of designing autonomous agents that can learn to cooperate effectively with a potentially suboptimal partner while having no access to the joint reward function. This problem is modeled as a cooperative episodic two-agent Markov decision process. We assume control over only the first of the two agents in a Stackelberg formulation of the game, where the second agent is acting so as to maximise expected utility given the first agent's policy. How should the first agent act in order to learn the joint reward function as quickly as possible and so that the joint policy is as close to optimal as possible? We analyse how knowledge about the reward function can be gained in this interactive two-agent scenario. We show that when the learning agent's policies have a significant effect on the transition function, the reward function can be learned efficiently. △ Less

Submitted 13 June, 2022; v1 submitted 8 November, 2021; originally announced November 2021.

Comments: ICML 2022

arXiv:2104.11834 [pdf, other]

High-dimensional near-optimal experiment design for drug discovery via Bayesian sparse sampling

Authors: Hannes Eriksson, Christos Dimitrakakis, Lars Carlsson

Abstract: We study the problem of performing automated experiment design for drug screening through Bayesian inference and optimisation. In particular, we compare and contrast the behaviour of linear-Gaussian models and Gaussian processes, when used in conjunction with upper confidence bound algorithms, Thompson sampling, or bounded horizon tree search. We show that non-myopic sophisticated exploration tech… ▽ More We study the problem of performing automated experiment design for drug screening through Bayesian inference and optimisation. In particular, we compare and contrast the behaviour of linear-Gaussian models and Gaussian processes, when used in conjunction with upper confidence bound algorithms, Thompson sampling, or bounded horizon tree search. We show that non-myopic sophisticated exploration techniques using sparse tree search have a distinct advantage over methods such as Thompson sampling or upper confidence bounds in this setting. We demonstrate the significant superiority of the approach over existing and synthetic datasets of drug toxicity. △ Less

Submitted 23 April, 2021; originally announced April 2021.

Comments: 14 pages, 6 figures

arXiv:2104.07276 [pdf, ps, other]

Adaptive Belief Discretization for POMDP Planning

Authors: Divya Grover, Christos Dimitrakakis

Abstract: Partially Observable Markov Decision Processes (POMDP) is a widely used model to represent the interaction of an environment and an agent, under state uncertainty. Since the agent does not observe the environment state, its uncertainty is typically represented through a probabilistic belief. While the set of possible beliefs is infinite, making exact planning intractable, the belief space's comple… ▽ More Partially Observable Markov Decision Processes (POMDP) is a widely used model to represent the interaction of an environment and an agent, under state uncertainty. Since the agent does not observe the environment state, its uncertainty is typically represented through a probabilistic belief. While the set of possible beliefs is infinite, making exact planning intractable, the belief space's complexity (and hence planning complexity) is characterized by its covering number. Many POMDP solvers uniformly discretize the belief space and give the planning error in terms of the (typically unknown) covering number. We instead propose an adaptive belief discretization scheme, and give its associated planning error. We furthermore characterize the covering number with respect to the POMDP parameters. This allows us to specify the exact memory requirements on the planner, needed to bound the value function error. We then propose a novel, computationally efficient solver using this scheme. We demonstrate that our algorithm is highly competitive with the state of the art in a variety of scenarios. △ Less

Submitted 15 April, 2021; originally announced April 2021.

arXiv:2102.11932 [pdf, other]

On Meritocracy in Optimal Set Selection

Authors: Thomas Kleine Buening, Meirav Segal, Debabrota Basu, Christos Dimitrakakis, Anne-Marie George

Abstract: Typically, merit is defined with respect to some intrinsic measure of worth. We instead consider a setting where an individual's worth is \emph{relative}: when a Decision Maker (DM) selects a set of individuals from a population to maximise expected utility, it is natural to consider the \emph{Expected Marginal Contribution} (EMC) of each person to the utility. We show that this notion satisfies a… ▽ More Typically, merit is defined with respect to some intrinsic measure of worth. We instead consider a setting where an individual's worth is \emph{relative}: when a Decision Maker (DM) selects a set of individuals from a population to maximise expected utility, it is natural to consider the \emph{Expected Marginal Contribution} (EMC) of each person to the utility. We show that this notion satisfies an axiomatic definition of fairness for this setting. We also show that for certain policy structures, this notion of fairness is aligned with maximising expected utility, while for linear utility functions it is identical to the Shapley value. However, for certain natural policies, such as those that select individuals with a specific set of attributes (e.g. high enough test scores for college admissions), there is a trade-off between meritocracy and utility maximisation. We analyse the effect of constraints on the policy on both utility and fairness in extensive experiments based on college admissions and outcomes in Norwegian universities. △ Less

Submitted 9 September, 2022; v1 submitted 23 February, 2021; originally announced February 2021.

Comments: EAAMO 2022

arXiv:2102.11075 [pdf, other]

SENTINEL: Taming Uncertainty with Ensemble-based Distributional Reinforcement Learning

Authors: Hannes Eriksson, Debabrota Basu, Mina Alibeigi, Christos Dimitrakakis

Abstract: In this paper, we consider risk-sensitive sequential decision-making in Reinforcement Learning (RL). Our contributions are two-fold. First, we introduce a novel and coherent quantification of risk, namely composite risk, which quantifies the joint effect of aleatory and epistemic risk during the learning process. Existing works considered either aleatory or epistemic risk individually, or as an ad… ▽ More In this paper, we consider risk-sensitive sequential decision-making in Reinforcement Learning (RL). Our contributions are two-fold. First, we introduce a novel and coherent quantification of risk, namely composite risk, which quantifies the joint effect of aleatory and epistemic risk during the learning process. Existing works considered either aleatory or epistemic risk individually, or as an additive combination. We prove that the additive formulation is a particular case of the composite risk when the epistemic risk measure is replaced with expectation. Thus, the composite risk is more sensitive to both aleatory and epistemic uncertainty than the individual and additive formulations. We also propose an algorithm, SENTINEL-K, based on ensemble bootstrap** and distributional RL for representing epistemic and aleatory uncertainty respectively. The ensemble of K learners uses Follow The Regularised Leader (FTRL) to aggregate the return distributions and obtain the composite risk. We experimentally verify that SENTINEL-K estimates the return distribution better, and while used with composite risk estimates, demonstrates higher risk-sensitive performance than state-of-the-art risk-sensitive and distributional RL algorithms. △ Less

Submitted 29 June, 2022; v1 submitted 22 February, 2021; originally announced February 2021.

Comments: 22 pages, 10 figures, 8 tables Accepted for UAI2022

arXiv:2002.03098 [pdf, other]

Inferential Induction: A Novel Framework for Bayesian Reinforcement Learning

Authors: Hannes Eriksson, Emilio Jorge, Christos Dimitrakakis, Debabrota Basu, Divya Grover

Abstract: Bayesian reinforcement learning (BRL) offers a decision-theoretic solution for reinforcement learning. While "model-based" BRL algorithms have focused either on maintaining a posterior distribution on models or value functions and combining this with approximate dynamic programming or tree search, previous Bayesian "model-free" value function distribution approaches implicitly make strong assumpti… ▽ More Bayesian reinforcement learning (BRL) offers a decision-theoretic solution for reinforcement learning. While "model-based" BRL algorithms have focused either on maintaining a posterior distribution on models or value functions and combining this with approximate dynamic programming or tree search, previous Bayesian "model-free" value function distribution approaches implicitly make strong assumptions or approximations. We describe a novel Bayesian framework, Inferential Induction, for correctly inferring value function distributions from data, which leads to the development of a new class of BRL algorithms. We design an algorithm, Bayesian Backwards Induction, with this framework. We experimentally demonstrate that the proposed algorithm is competitive with respect to the state of the art. △ Less

Submitted 1 July, 2020; v1 submitted 8 February, 2020; originally announced February 2020.

Comments: 28 pages, 12 figures

arXiv:1912.08523 [pdf, other]

Privacy of Real-Time Pricing in Smart Grid

Authors: Mahrokh GhoddousiBoroujeni, Dominik Fay, Christos Dimitrakakis, Maryam Kamgarpour

Abstract: Installing smart meters to publish real-time electricity rates has been controversial while it might lead to privacy concerns. Dispatched rates include fine-grained data on aggregate electricity consumption in a zone and could potentially be used to infer a household's pattern of energy use or its occupancy. In this paper, we propose Blowfish privacy to protect the occupancy state of the houses co… ▽ More Installing smart meters to publish real-time electricity rates has been controversial while it might lead to privacy concerns. Dispatched rates include fine-grained data on aggregate electricity consumption in a zone and could potentially be used to infer a household's pattern of energy use or its occupancy. In this paper, we propose Blowfish privacy to protect the occupancy state of the houses connected to a smart grid. First, we introduce a Markov model of the relationship between electricity rate and electricity consumption. Next, we develop an algorithm that perturbs electricity rates before publishing them to ensure users' privacy. Last, the proposed algorithm is tested on data inspired by household occupancy models and its performance is compared to an alternative solution. △ Less

Submitted 18 December, 2019; originally announced December 2019.

arXiv:1906.09114 [pdf, other]

Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

Authors: Aristide Tossou, Christos Dimitrakakis, Debabrota Basu

Abstract: We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number $D$. An MDP consists of $S$ states and $A$ possible actions per state. Upon choosing an action $a_t$ at state $s_t$, one receives a real value reward $r_t$, then one transits to a next state $s_{t+1}$. T… ▽ More We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number $D$. An MDP consists of $S$ states and $A$ possible actions per state. Upon choosing an action $a_t$ at state $s_t$, one receives a real value reward $r_t$, then one transits to a next state $s_{t+1}$. The reward $r_t$ is generated from a fixed reward distribution depending only on $(s_t, a_t)$ and similarly, the next state $s_{t+1}$ is generated from a fixed transition distribution depending only on $(s_t, a_t)$. The objective is to maximize the accumulated rewards after $T$ interactions. In this paper, we consider the case where the reward distributions, the transitions, $T$ and $D$ are all unknown. We derive the first polynomial time Bayesian algorithm, BUCRL{} that achieves up to logarithm factors, a regret (i.e the difference between the accumulated rewards of the optimal policy and our algorithm) of the optimal order $\tilde{\mathcal{O}}(\sqrt{DSAT})$. Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret. We perform experiments in a variety of environments that demonstrate the superiority of our algorithm over previous techniques. Our work also illustrates several results that will be of independent interest. In particular, we derive a sharper upper bound for the KL-divergence of Bernoulli random variables. We also derive sharper upper and lower bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions. △ Less

Submitted 9 July, 2019; v1 submitted 20 June, 2019; originally announced June 2019.

Comments: Improved the text and added detailed proofs of claims Change title to better express the solution proposed

arXiv:1906.06273 [pdf, other]

Epistemic Risk-Sensitive Reinforcement Learning

Authors: Hannes Eriksson, Christos Dimitrakakis

Abstract: We develop a framework for interacting with uncertain environments in reinforcement learning (RL) by leveraging preferences in the form of utility functions. We claim that there is value in considering different risk measures during learning. In this framework, the preference for risk can be tuned by variation of the parameter $β$ and the resulting behavior can be risk-averse, risk-neutral or risk… ▽ More We develop a framework for interacting with uncertain environments in reinforcement learning (RL) by leveraging preferences in the form of utility functions. We claim that there is value in considering different risk measures during learning. In this framework, the preference for risk can be tuned by variation of the parameter $β$ and the resulting behavior can be risk-averse, risk-neutral or risk-taking depending on the parameter choice. We evaluate our framework for learning problems with model uncertainty. We measure and control for \emph{epistemic} risk using dynamic programming (DP) and policy gradient-based algorithms. The risk-averse behavior is then compared with the behavior of the optimal risk-neutral policy in environments with epistemic risk. △ Less

Submitted 14 June, 2019; originally announced June 2019.

Comments: 8 pages, 2 figures

Journal ref: Proceedings of the 28th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2020) 339-344

arXiv:1906.01609 [pdf, ps, other]

Near-Optimal Online Egalitarian learning in General Sum Repeated Matrix Games

Authors: Aristide Tossou, Christos Dimitrakakis, Jaroslaw Rzepecki, Katja Hofmann

Abstract: We study two-player general sum repeated finite games where the rewards of each player are generated from an unknown distribution. Our aim is to find the egalitarian bargaining solution (EBS) for the repeated game, which can lead to much higher rewards than the maximin value of both players. Our most important contribution is the derivation of an algorithm that achieves simultaneously, for both pl… ▽ More We study two-player general sum repeated finite games where the rewards of each player are generated from an unknown distribution. Our aim is to find the egalitarian bargaining solution (EBS) for the repeated game, which can lead to much higher rewards than the maximin value of both players. Our most important contribution is the derivation of an algorithm that achieves simultaneously, for both players, a high-probability regret bound of order $\mathcal{O}(\sqrt[3]{\ln T}\cdot T^{2/3})$ after any $T$ rounds of play. We demonstrate that our upper bound is nearly optimal by proving a lower bound of $Ω(T^{2/3})$ for any algorithm. △ Less

Submitted 4 June, 2019; originally announced June 2019.

arXiv:1905.12425 [pdf, other]

Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities

Authors: Aristide Tossou, Debabrota Basu, Christos Dimitrakakis

Abstract: We study model-based reinforcement learning in an unknown finite communicating Markov decision process. We propose a simple algorithm that leverages a variance based confidence interval. We show that the proposed algorithm, UCRL-V, achieves the optimal regret $\tilde{\mathcal{O}}(\sqrt{DSAT})$ up to logarithmic factors, and so our work closes a gap with the lower bound without additional assumptio… ▽ More We study model-based reinforcement learning in an unknown finite communicating Markov decision process. We propose a simple algorithm that leverages a variance based confidence interval. We show that the proposed algorithm, UCRL-V, achieves the optimal regret $\tilde{\mathcal{O}}(\sqrt{DSAT})$ up to logarithmic factors, and so our work closes a gap with the lower bound without additional assumptions on the MDP. We perform experiments in a variety of environments that validates the theoretical bounds as well as prove UCRL-V to be better than the state-of-the-art algorithms. △ Less

Submitted 11 December, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: the algorithm has been simplified (no need to look at lower bound of the reward and transitions). Proof has been significantly clean-up. The previous "assumption" is clarified as a condition of the algorithm well-known as sub-modularity. The proof that the bounds satisfy the submodularity is clean-up

arXiv:1905.12298 [pdf, ps, other]

Differential Privacy for Multi-armed Bandits: What Is It and What Is Its Cost?

Authors: Debabrota Basu, Christos Dimitrakakis, Aristide Tossou

Abstract: Based on differential privacy (DP) framework, we introduce and unify privacy definitions for the multi-armed bandit algorithms. We represent the framework with a unified graphical model and use it to connect privacy definitions. We derive and contrast lower bounds on the regret of bandit algorithms satisfying these definitions. We leverage a unified proving technique to achieve all the lower bound… ▽ More Based on differential privacy (DP) framework, we introduce and unify privacy definitions for the multi-armed bandit algorithms. We represent the framework with a unified graphical model and use it to connect privacy definitions. We derive and contrast lower bounds on the regret of bandit algorithms satisfying these definitions. We leverage a unified proving technique to achieve all the lower bounds. We show that for all of them, the learner's regret is increased by a multiplicative factor dependent on the privacy level $ε$. We observe that the dependency is weaker when we do not require local differential privacy for the rewards. △ Less

Submitted 23 June, 2020; v1 submitted 29 May, 2019; originally announced May 2019.

Comments: 27 pages, 1 figure, 2 tables, 14 theorems

MSC Class: 62L10; 94A15

arXiv:1904.03535 [pdf, other]

Randomised Bayesian Least-Squares Policy Iteration

Authors: Nikolaos Tziortziotis, Christos Dimitrakakis, Michalis Vazirgiannis

Abstract: We introduce Bayesian least-squares policy iteration (BLSPI), an off-policy, model-free, policy iteration algorithm that uses the Bayesian least-squares temporal-difference (BLSTD) learning algorithm to evaluate policies. An online variant of BLSPI has been also proposed, called randomised BLSPI (RBLSPI), that improves its policy based on an incomplete policy evaluation step. In online setting, th… ▽ More We introduce Bayesian least-squares policy iteration (BLSPI), an off-policy, model-free, policy iteration algorithm that uses the Bayesian least-squares temporal-difference (BLSTD) learning algorithm to evaluate policies. An online variant of BLSPI has been also proposed, called randomised BLSPI (RBLSPI), that improves its policy based on an incomplete policy evaluation step. In online setting, the exploration-exploitation dilemma should be addressed as we try to discover the optimal policy by using samples collected by ourselves. RBLSPI exploits the advantage of BLSTD to quantify our uncertainty about the value function. Inspired by Thompson sampling, RBLSPI first samples a value function from a posterior distribution over value functions, and then selects actions based on the sampled value function. The effectiveness and the exploration abilities of RBLSPI are demonstrated experimentally in several environments. △ Less

Submitted 6 April, 2019; originally announced April 2019.

Comments: European Workshop on Reinforcement Learning 14, October 2018, Lille, France

arXiv:1902.02661 [pdf, other]

Bayesian Reinforcement Learning via Deep, Sparse Sampling

Authors: Divya Grover, Debabrota Basu, Christos Dimitrakakis

Abstract: We address the problem of Bayesian reinforcement learning using efficient model-based online planning. We propose an optimism-free Bayes-adaptive algorithm to induce deeper and sparser exploration with a theoretical bound on its performance relative to the Bayes optimal policy, with a lower computational complexity. The main novelty is the use of a candidate policy generator, to generate long-term… ▽ More We address the problem of Bayesian reinforcement learning using efficient model-based online planning. We propose an optimism-free Bayes-adaptive algorithm to induce deeper and sparser exploration with a theoretical bound on its performance relative to the Bayes optimal policy, with a lower computational complexity. The main novelty is the use of a candidate policy generator, to generate long-term options in the planning tree (over beliefs), which allows us to create much sparser and deeper trees. Experimental results on different environments show that in comparison to the state-of-the-art, our algorithm is both computationally more efficient, and obtains significantly higher reward in discrete environments. △ Less

Submitted 27 June, 2020; v1 submitted 7 February, 2019; originally announced February 2019.

Comments: Published in AISTATS 2020

arXiv:1902.00941 [pdf, other]

doi 10.1088/1361-6471/ab2b14

Bayesian optimization in ab initio nuclear physics

Authors: A. Ekström, C. Forssén, C. Dimitrakakis, D. Dubhashi, H. T. Johansson, A. S. Muhammad, H. Salomonsson, A. Schliep

Abstract: Theoretical models of the strong nuclear interaction contain unknown coupling constants (parameters) that must be determined using a pool of calibration data. In cases where the models are complex, leading to time consuming calculations, it is particularly challenging to systematically search the corresponding parameter domain for the best fit to the data. In this paper, we explore the prospect of… ▽ More Theoretical models of the strong nuclear interaction contain unknown coupling constants (parameters) that must be determined using a pool of calibration data. In cases where the models are complex, leading to time consuming calculations, it is particularly challenging to systematically search the corresponding parameter domain for the best fit to the data. In this paper, we explore the prospect of applying Bayesian optimization to constrain the coupling constants in chiral effective field theory descriptions of the nuclear interaction. We find that Bayesian optimization performs rather well with low-dimensional parameter domains and foresee that it can be particularly useful for optimization of a smaller set of coupling constants. A specific example could be the determination of leading three-nucleon forces using data from finite nuclei or three-nucleon scattering experiments. △ Less

Submitted 3 February, 2019; originally announced February 2019.

Comments: 33 pages, 14 figures

arXiv:1806.09192 [pdf, ps, other]

On The Differential Privacy of Thompson Sampling With Gaussian Prior

Authors: Aristide C. Y. Tossou, Christos Dimitrakakis

Abstract: We show that Thompson Sampling with Gaussian Prior as detailed by Algorithm 2 in (Agrawal & Goyal, 2013) is already differentially private. Theorem 1 show that it enjoys a very competitive privacy loss of only $\mathcal{O}(\ln^2 T)$ after T rounds. Finally, Theorem 2 show that one can control the privacy loss to any desirable $ε$ level by appropriately increasing the variance of the samples from t… ▽ More We show that Thompson Sampling with Gaussian Prior as detailed by Algorithm 2 in (Agrawal & Goyal, 2013) is already differentially private. Theorem 1 show that it enjoys a very competitive privacy loss of only $\mathcal{O}(\ln^2 T)$ after T rounds. Finally, Theorem 2 show that one can control the privacy loss to any desirable $ε$ level by appropriately increasing the variance of the samples from the Gaussian posterior. And this increases the regret only by a term of $\mathcal{O}(\frac{\ln^2 T}ε)$. This compares favorably to the previous result for Thompson Sampling in the literature ((Mishra & Thakurta, 2015)) which adds a term of $\mathcal{O}(\frac{K \ln^3 T}{ε^2})$ to the regret in order to achieve the same privacy level. Furthermore, our result use the basic Thompson Sampling with few modifications whereas the result of (Mishra & Thakurta, 2015) required sophisticated constructions. △ Less

Submitted 24 June, 2018; originally announced June 2018.

Comments: Accepted in Privacy in Machine Learning and Artificial Intelligence Workshop 2018

arXiv:1707.09678 [pdf, ps, other]

Learning to Match

Authors: Philip Ekman, Sebastian Bellevik, Christos Dimitrakakis, Aristide Tossou

Abstract: Outsourcing tasks to previously unknown parties is becoming more common. One specific such problem involves matching a set of workers to a set of tasks. Even if the latter have precise requirements, the quality of individual workers is usually unknown. The problem is thus a version of matching under uncertainty. We believe that this type of problem is going to be increasingly important. When the… ▽ More Outsourcing tasks to previously unknown parties is becoming more common. One specific such problem involves matching a set of workers to a set of tasks. Even if the latter have precise requirements, the quality of individual workers is usually unknown. The problem is thus a version of matching under uncertainty. We believe that this type of problem is going to be increasingly important. When the problem involves only a single skill or type of job, it is essentially a type of bandit problem, and can be solved with standard algorithms. However, we develop an algorithm that can perform matching for workers with multiple skills hired for multiple jobs with multiple requirements. We perform an experimental evaluation in both single-task and multi-task problems, comparing with the bounded $ε$-first algorithm, as well as an oracle that knows the true skills of workers. One of the algorithms we developed gives results approaching 85\% of oracle's performance. We invite the community to take a closer look at this problem and develop real-world benchmarks. △ Less

Submitted 30 July, 2017; originally announced July 2017.

Comments: 5 pages. This version will be presented at the VAMS Recsys workshop 2017

arXiv:1707.01875 [pdf, ps, other]

Calibrated Fairness in Bandits

Authors: Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, David C. Parkes

Abstract: We study fairness within the stochastic, \emph{multi-armed bandit} (MAB) decision making framework. We adapt the fairness framework of "treating similar individuals similarly" to this setting. Here, an `individual' corresponds to an arm and two arms are `similar' if they have a similar quality distribution. First, we adopt a {\em smoothness constraint} that if two arms have a similar quality distr… ▽ More We study fairness within the stochastic, \emph{multi-armed bandit} (MAB) decision making framework. We adapt the fairness framework of "treating similar individuals similarly" to this setting. Here, an `individual' corresponds to an arm and two arms are `similar' if they have a similar quality distribution. First, we adopt a {\em smoothness constraint} that if two arms have a similar quality distribution then the probability of selecting each arm should be similar. In addition, we define the {\em fairness regret}, which corresponds to the degree to which an algorithm is not calibrated, where perfect calibration requires that the probability of selecting an arm is equal to the probability with which the arm has the best quality realization. We show that a variation on Thompson sampling satisfies smooth fairness for total variation distance, and give an $\tilde{O}((kT)^{2/3})$ bound on fairness regret. This complements prior work, which protects an on-average better arm from being less favored. We also explain how to extend our algorithm to the dueling bandit setting. △ Less

Submitted 6 July, 2017; originally announced July 2017.

Comments: To be presented at the FAT-ML'17 workshop

arXiv:1706.00119 [pdf, ps, other]

Bayesian fairness

Authors: Christos Dimitrakakis, Yang Liu, David Parkes, Goran Radanovic

Abstract: We consider the problem of how decision making can be fair when the underlying probabilistic model of the world is not known with certainty. We argue that recent notions of fairness in machine learning need to explicitly incorporate parameter uncertainty, hence we introduce the notion of {\em Bayesian fairness} as a suitable candidate for fair decision rules. Using balance, a definition of fairnes… ▽ More We consider the problem of how decision making can be fair when the underlying probabilistic model of the world is not known with certainty. We argue that recent notions of fairness in machine learning need to explicitly incorporate parameter uncertainty, hence we introduce the notion of {\em Bayesian fairness} as a suitable candidate for fair decision rules. Using balance, a definition of fairness introduced by Kleinberg et al (2016), we show how a Bayesian perspective can lead to well-performing, fair decision rules even under high uncertainty. △ Less

Submitted 4 November, 2018; v1 submitted 31 May, 2017; originally announced June 2017.

Comments: 13 pages, 8 figures, to appear at AAAI 2019

arXiv:1701.04238 [pdf, other]

Thompson Sampling For Stochastic Bandits with Graph Feedback

Authors: Aristide C. Y. Tossou, Christos Dimitrakakis, Devdatt Dubhashi

Abstract: We present a novel extension of Thompson Sampling for stochastic sequential decision problems with graph feedback, even when the graph structure itself is unknown and/or changing. We provide theoretical guarantees on the Bayesian regret of the algorithm, linking its performance to the underlying properties of the graph. Thompson Sampling has the advantage of being applicable without the need to co… ▽ More We present a novel extension of Thompson Sampling for stochastic sequential decision problems with graph feedback, even when the graph structure itself is unknown and/or changing. We provide theoretical guarantees on the Bayesian regret of the algorithm, linking its performance to the underlying properties of the graph. Thompson Sampling has the advantage of being applicable without the need to construct complicated upper confidence bounds for different problems. We illustrate its performance through extensive experimental results on real and simulated networks with graph feedback. More specifically, we tested our algorithms on power law, planted partitions and Erdo's-Renyi graphs, as well as on graphs derived from Facebook and Flixster data. These all show that our algorithms clearly outperform related methods that employ upper confidence bounds, even if the latter use more information about the graph. △ Less

Submitted 16 January, 2017; originally announced January 2017.

arXiv:1701.04222 [pdf, other]

Achieving Privacy in the Adversarial Multi-Armed Bandit

Authors: Aristide C. Y. Tossou, Christos Dimitrakakis

Abstract: In this paper, we improve the previously best known regret bound to achieve $ε$-differential privacy in oblivious adversarial bandits from $\mathcal{O}{(T^{2/3}/ε)}$ to $\mathcal{O}{(\sqrt{T} \ln T /ε)}$. This is achieved by combining a Laplace Mechanism with EXP3. We show that though EXP3 is already differentially private, it leaks a linear amount of information in $T$. However, we can improve th… ▽ More In this paper, we improve the previously best known regret bound to achieve $ε$-differential privacy in oblivious adversarial bandits from $\mathcal{O}{(T^{2/3}/ε)}$ to $\mathcal{O}{(\sqrt{T} \ln T /ε)}$. This is achieved by combining a Laplace Mechanism with EXP3. We show that though EXP3 is already differentially private, it leaks a linear amount of information in $T$. However, we can improve this privacy by relying on its intrinsic exponential mechanism for selecting actions. This allows us to reach $\mathcal{O}{(\sqrt{\ln T})}$-DP, with a regret of $\mathcal{O}{(T^{2/3})}$ that holds against an adaptive adversary, an improvement from the best known of $\mathcal{O}{(T^{3/4})}$. This is done by using an algorithm that run EXP3 in a mini-batch loop. Finally, we run experiments that clearly demonstrate the validity of our theoretical analysis. △ Less

Submitted 16 January, 2017; originally announced January 2017.

arXiv:1512.06992 [pdf, other]

On the Differential Privacy of Bayesian Inference

Authors: Zuhe Zhang, Benjamin Rubinstein, Christos Dimitrakakis

Abstract: We study how to communicate findings of Bayesian inference to third parties, while preserving the strong guarantee of differential privacy. Our main contributions are four different algorithms for private Bayesian inference on proba-bilistic graphical models. These include two mechanisms for adding noise to the Bayesian updates, either directly to the posterior parameters, or to their Fourier tran… ▽ More We study how to communicate findings of Bayesian inference to third parties, while preserving the strong guarantee of differential privacy. Our main contributions are four different algorithms for private Bayesian inference on proba-bilistic graphical models. These include two mechanisms for adding noise to the Bayesian updates, either directly to the posterior parameters, or to their Fourier transform so as to preserve update consistency. We also utilise a recently introduced posterior sampling mechanism, for which we prove bounds for the specific but general case of discrete Bayesian networks; and we introduce a maximum-a-posteriori private mechanism. Our analysis includes utility and privacy bounds, with a novel focus on the influence of graph structure on privacy. Worked examples and experiments with Bayesian na{ï}ve Bayes and Bayesian linear regression illustrate the application of our mechanisms. △ Less

Submitted 22 December, 2015; originally announced December 2015.

Comments: AAAI 2016, Feb 2016, Phoenix, Arizona, United States

arXiv:1511.08681 [pdf, ps, other]

Algorithms for Differentially Private Multi-Armed Bandits

Authors: Aristide Tossou, Christos Dimitrakakis

Abstract: We present differentially private algorithms for the stochastic Multi-Armed Bandit (MAB) problem. This is a problem for applications such as adaptive clinical trials, experiment design, and user-targeted advertising where private information is connected to individual rewards. Our major contribution is to show that there exist $(ε, δ)$ differentially private variants of Upper Confidence Bound algo… ▽ More We present differentially private algorithms for the stochastic Multi-Armed Bandit (MAB) problem. This is a problem for applications such as adaptive clinical trials, experiment design, and user-targeted advertising where private information is connected to individual rewards. Our major contribution is to show that there exist $(ε, δ)$ differentially private variants of Upper Confidence Bound algorithms which have optimal regret, $O(ε^{-1} + \log T)$. This is a significant improvement over previous results, which only achieve poly-log regret $O(ε^{-2} \log^{2} T)$, because of our use of a novel interval-based mechanism. We also substantially improve the bounds of previous family of algorithms which use a continual release mechanism. Experiments clearly validate our theoretical bounds. △ Less

Submitted 27 November, 2015; originally announced November 2015.

Journal ref: AAAI 2016, Feb 2016, Phoenix, Arizona, United States

arXiv:1412.3276 [pdf, ps, other]

Generalised Entropy MDPs and Minimax Regret

Authors: Emmanouil G. Androulakis, Christos Dimitrakakis

Abstract: Bayesian methods suffer from the problem of how to specify prior beliefs. One interesting idea is to consider worst-case priors. This requires solving a stochastic zero-sum game. In this paper, we extend well-known results from bandit theory in order to discover minimax-Bayes policies and discuss when they are practical. Bayesian methods suffer from the problem of how to specify prior beliefs. One interesting idea is to consider worst-case priors. This requires solving a stochastic zero-sum game. In this paper, we extend well-known results from bandit theory in order to discover minimax-Bayes policies and discuss when they are practical. △ Less

Submitted 10 December, 2014; originally announced December 2014.

Comments: 7 pages, NIPS workshop "From bad models to good policies"

arXiv:1408.2067 [pdf]

Probabilistic inverse reinforcement learning in unknown environments

Authors: Aristide Tossou, Christos Dimitrakakis

Abstract: We consider the problem of learning by demonstration from agents acting in unknown stochastic Markov environments or games. Our aim is to estimate agent preferences in order to construct improved policies for the same task that the agents are trying to solve. To do so, we extend previous probabilistic approaches for inverse reinforcement learning in known MDPs to the case of unknown dynamics or op… ▽ More We consider the problem of learning by demonstration from agents acting in unknown stochastic Markov environments or games. Our aim is to estimate agent preferences in order to construct improved policies for the same task that the agents are trying to solve. To do so, we extend previous probabilistic approaches for inverse reinforcement learning in known MDPs to the case of unknown dynamics or opponents. We do this by deriving two simplified probabilistic models of the demonstrator's policy and utility. For tractability, we use maximum a posteriori estimation rather than full Bayesian inference. Under a flat prior, this results in a convex optimisation problem. We find that the resulting algorithms are highly competitive against a variety of other methods for inverse reinforcement learning that do have knowledge of the dynamics. △ Less

Submitted 9 August, 2014; originally announced August 2014.

Comments: Appears in Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI2013)

Report number: UAI-P-2013-PG-635-643

arXiv:1307.3785 [pdf, ps, other]

Probabilistic inverse reinforcement learning in unknown environments

Authors: Aristide C. Y. Tossou, Christos Dimitrakakis

Abstract: We consider the problem of learning by demonstration from agents acting in unknown stochastic Markov environments or games. Our aim is to estimate agent preferences in order to construct improved policies for the same task that the agents are trying to solve. To do so, we extend previous probabilistic approaches for inverse reinforcement learning in known MDPs to the case of unknown dynamics or op… ▽ More We consider the problem of learning by demonstration from agents acting in unknown stochastic Markov environments or games. Our aim is to estimate agent preferences in order to construct improved policies for the same task that the agents are trying to solve. To do so, we extend previous probabilistic approaches for inverse reinforcement learning in known MDPs to the case of unknown dynamics or opponents. We do this by deriving two simplified probabilistic models of the demonstrator's policy and utility. For tractability, we use maximum a posteriori estimation rather than full Bayesian inference. Under a flat prior, this results in a convex optimisation problem. We find that the resulting algorithms are highly competitive against a variety of other methods for inverse reinforcement learning that do have knowledge of the dynamics. △ Less

Submitted 14 July, 2013; originally announced July 2013.

Comments: UAI 2013

arXiv:1306.1066 [pdf, other]

Bayesian Differential Privacy through Posterior Sampling

Authors: Christos Dimitrakakis, Blaine Nelson, and Zuhe Zhang, Aikaterini Mitrokotsa, Benjamin Rubinstein

Abstract: Differential privacy formalises privacy-preserving mechanisms that provide access to a database. We pose the question of whether Bayesian inference itself can be used directly to provide private access to data, with no modification. The answer is affirmative: under certain conditions on the prior, sampling from the posterior distribution can be used to achieve a desired level of privacy and utilit… ▽ More Differential privacy formalises privacy-preserving mechanisms that provide access to a database. We pose the question of whether Bayesian inference itself can be used directly to provide private access to data, with no modification. The answer is affirmative: under certain conditions on the prior, sampling from the posterior distribution can be used to achieve a desired level of privacy and utility. To do so, we generalise differential privacy to arbitrary dataset metrics, outcome spaces and distribution families. This allows us to also deal with non-i.i.d or non-tabular datasets. We prove bounds on the sensitivity of the posterior to the data, which gives a measure of robustness. We also show how to use posterior sampling to provide differentially private responses to queries, within a decision-theoretic framework. Finally, we provide bounds on the utility and on the distinguishability of datasets. The latter are complemented by a novel use of Le Cam's method to obtain lower bounds. All our general results hold for arbitrary database metrics, including those for the common definition of differential privacy. For specific choices of the metric, we give a number of examples satisfying our assumptions. △ Less

Submitted 23 December, 2016; v1 submitted 5 June, 2013; originally announced June 2013.

Comments: 38 pages; An earlier version of this article was published in ALT 2014. This version has corrections and additional results

arXiv:1305.1809 [pdf, other]

Cover Tree Bayesian Reinforcement Learning

Authors: Nikolaos Tziortziotis, Christos Dimitrakakis, Konstantinos Blekas

Abstract: This paper proposes an online tree-based Bayesian approach for reinforcement learning. For inference, we employ a generalised context tree model. This defines a distribution on multivariate Gaussian piecewise-linear models, which can be updated in closed form. The tree structure itself is constructed using the cover tree method, which remains efficient in high dimensional spaces. We combine the mo… ▽ More This paper proposes an online tree-based Bayesian approach for reinforcement learning. For inference, we employ a generalised context tree model. This defines a distribution on multivariate Gaussian piecewise-linear models, which can be updated in closed form. The tree structure itself is constructed using the cover tree method, which remains efficient in high dimensional spaces. We combine the model with Thompson sampling and approximate dynamic programming to obtain effective exploration policies in unknown environments. The flexibility and computational simplicity of the model render it suitable for many reinforcement learning problems in continuous state spaces. We demonstrate this in an experimental comparison with least squares policy iteration. △ Less

Submitted 2 May, 2014; v1 submitted 8 May, 2013; originally announced May 2013.

arXiv:1303.6977 [pdf, ps, other]

ABC Reinforcement Learning

Authors: Christos Dimitrakakis, Nikolaos Tziortziotis

Abstract: This paper introduces a simple, general framework for likelihood-free Bayesian reinforcement learning, through Approximate Bayesian Computation (ABC). The main advantage is that we only require a prior distribution on a class of simulators (generative models). This is useful in domains where an analytical probabilistic model of the underlying process is too complex to formulate, but where detailed… ▽ More This paper introduces a simple, general framework for likelihood-free Bayesian reinforcement learning, through Approximate Bayesian Computation (ABC). The main advantage is that we only require a prior distribution on a class of simulators (generative models). This is useful in domains where an analytical probabilistic model of the underlying process is too complex to formulate, but where detailed simulation models are available. ABC-RL allows the use of any Bayesian reinforcement learning technique, even in this case. In addition, it can be seen as an extension of rollout algorithms to the case where we do not know what the correct model to draw rollouts from is. We experimentally demonstrate the potential of this approach in a comparison with LSPI. Finally, we introduce a theorem showing that ABC is a sound methodology in principle, even when non-sufficient statistics are used. △ Less

Submitted 28 June, 2013; v1 submitted 27 March, 2013; originally announced March 2013.

Comments: Corrected version of paper appearing in ICML 2013

arXiv:1303.2506 [pdf, ps, other]

doi 10.1109/CDC.2013.6761048

Monte-Carlo utility estimates for Bayesian reinforcement learning

Authors: Christos Dimitrakakis

Abstract: This paper introduces a set of algorithms for Monte-Carlo Bayesian reinforcement learning. Firstly, Monte-Carlo estimation of upper bounds on the Bayes-optimal value function is employed to construct an optimistic policy. Secondly, gradient-based algorithms for approximate upper and lower bounds are introduced. Finally, we introduce a new class of gradient algorithms for Bayesian Bellman error min… ▽ More This paper introduces a set of algorithms for Monte-Carlo Bayesian reinforcement learning. Firstly, Monte-Carlo estimation of upper bounds on the Bayes-optimal value function is employed to construct an optimistic policy. Secondly, gradient-based algorithms for approximate upper and lower bounds are introduced. Finally, we introduce a new class of gradient algorithms for Bayesian Bellman error minimisation. We theoretically show that the gradient methods are sound. Experimentally, we demonstrate the superiority of the upper bound method in terms of reward obtained. However, we also show that the Bayesian Bellman error method is a close second, despite its significant computational simplicity. △ Less

Submitted 11 March, 2013; originally announced March 2013.

Comments: 6 pages, 4 figures, 1 table, submitted to IEEE conference on decision and control

arXiv:1303.0665 [pdf, other]

doi 10.1145/2507157.2507166

Personalized News Recommendation with Context Trees

Authors: Florent Garcin, Christos Dimitrakakis, Boi Faltings

Abstract: The profusion of online news articles makes it difficult to find interesting articles, a problem that can be assuaged by using a recommender system to bring the most relevant news stories to readers. However, news recommendation is challenging because the most relevant articles are often new content seen by few users. In addition, they are subject to trends and preference changes over time, and in… ▽ More The profusion of online news articles makes it difficult to find interesting articles, a problem that can be assuaged by using a recommender system to bring the most relevant news stories to readers. However, news recommendation is challenging because the most relevant articles are often new content seen by few users. In addition, they are subject to trends and preference changes over time, and in many cases we do not have sufficient information to profile the reader. In this paper, we introduce a class of news recommendation systems based on context trees. They can provide high-quality news recommendation to anonymous visitors based on present browsing behaviour. We show that context-tree recommender systems provide good prediction accuracy and recommendation novelty, and they are sufficiently flexible to capture the unique properties of news articles. △ Less

Submitted 3 November, 2014; v1 submitted 4 March, 2013; originally announced March 2013.

Journal ref: Proceedings of the 7th ACM conference on Recommender systems (2013), pp. 105--112

arXiv:1208.5641 [pdf, ps, other]

Near-Optimal Blacklisting

Authors: Christos Dimitrakakis, Aikaterini Mitrokotsa

Abstract: Many applications involve agents sharing a resource, such as networks or services. When agents are honest, the system functions well and there is a net profit. Unfortunately, some agents may be malicious, but it may be hard to detect them. We consider the intrusion response problem of how to permanently blacklist agents, in order to maximise expected profit. This is not trivial, as blacklisting ma… ▽ More Many applications involve agents sharing a resource, such as networks or services. When agents are honest, the system functions well and there is a net profit. Unfortunately, some agents may be malicious, but it may be hard to detect them. We consider the intrusion response problem of how to permanently blacklist agents, in order to maximise expected profit. This is not trivial, as blacklisting may erroneously expel honest agents. Conversely, while we gain information by allowing an agent to remain, we may incur a cost due to malicious behaviour. We present an efficient algorithm (HIPER) for making near-optimal decisions for this problem. Additionally, we derive three algorithms by reducing the problem to a Markov decision process (MDP). Theoretically, we show that HIPER is near-optimal. Experimentally, its performance is close to that of the full MDP solution, when the (stronger) requirements of the latter are met. △ Less

Submitted 29 July, 2013; v1 submitted 28 August, 2012; originally announced August 2012.

Comments: Submitted to INFOCOM 2014, 10 pages, 3 figures

arXiv:1201.2555 [pdf, ps, other]

Sparse Reward Processes

Authors: Christos Dimitrakakis

Abstract: We introduce a class of learning problems where the agent is presented with a series of tasks. Intuitively, if there is relation among those tasks, then the information gained during execution of one task has value for the execution of another task. Consequently, the agent is intrinsically motivated to explore its environment beyond the degree necessary to solve the current task it has at hand. We… ▽ More We introduce a class of learning problems where the agent is presented with a series of tasks. Intuitively, if there is relation among those tasks, then the information gained during execution of one task has value for the execution of another task. Consequently, the agent is intrinsically motivated to explore its environment beyond the degree necessary to solve the current task it has at hand. We develop a decision theoretic setting that generalises standard reinforcement learning tasks and captures this intuition. More precisely, we consider a multi-stage stochastic game between a learning agent and an opponent. We posit that the setting is a good model for the problem of life-long learning in uncertain environments, where while resources must be spent learning about currently important tasks, there is also the need to allocate effort towards learning about aspects of the world which are not relevant at the moment. This is due to the fact that unpredictable future events may lead to a change of priorities for the decision maker. Thus, in some sense, the model "explains" the necessity of curiosity. Apart from introducing the general formalism, the paper provides algorithms. These are evaluated experimentally in some exemplary domains. In addition, performance bounds are proven for some cases of this problem. △ Less

Submitted 5 September, 2012; v1 submitted 12 January, 2012; originally announced January 2012.

Comments: 14 pages, 4 figures

arXiv:1106.3655 [pdf, ps, other]

doi 10.1007/978-3-642-29946-9_27

Bayesian multitask inverse reinforcement learning

Authors: Christos Dimitrakakis, Constantin Rothkopf

Abstract: We generalise the problem of inverse reinforcement learning to multiple tasks, from multiple demonstrations. Each one may represent one expert trying to solve a different task, or as different experts trying to solve the same task. Our main contribution is to formalise the problem as statistical preference elicitation, via a number of structured priors, whose form captures our biases about the rel… ▽ More We generalise the problem of inverse reinforcement learning to multiple tasks, from multiple demonstrations. Each one may represent one expert trying to solve a different task, or as different experts trying to solve the same task. Our main contribution is to formalise the problem as statistical preference elicitation, via a number of structured priors, whose form captures our biases about the relatedness of different tasks or expert policies. In doing so, we introduce a prior on policy optimality, which is more natural to specify. We show that our framework allows us not only to learn to efficiently from multiple experts but to also effectively differentiate between the goals of each. Possible applications include analysing the intrinsic motivations of subjects in behavioural experiments and learning from multiple teachers. △ Less

Submitted 17 November, 2011; v1 submitted 18 June, 2011; originally announced June 2011.

Comments: Corrected version. 13 pages, 8 figures

MSC Class: 62C10; 91B08; 91B10 ACM Class: G.3

Journal ref: Recent Advances in Reinforcement Learning LNCS 7188, pp. 273-284, 2012

arXiv:1106.3651 [pdf, ps, other]

Robust Bayesian reinforcement learning through tight lower bounds

Authors: Christos Dimitrakakis

Abstract: In the Bayesian approach to sequential decision making, exact calculation of the (subjective) utility is intractable. This extends to most special cases of interest, such as reinforcement learning problems. While utility bounds are known to exist for this problem, so far none of them were particularly tight. In this paper, we show how to efficiently calculate a lower bound, which corresponds to th… ▽ More In the Bayesian approach to sequential decision making, exact calculation of the (subjective) utility is intractable. This extends to most special cases of interest, such as reinforcement learning problems. While utility bounds are known to exist for this problem, so far none of them were particularly tight. In this paper, we show how to efficiently calculate a lower bound, which corresponds to the utility of a near-optimal memoryless policy for the decision problem, which is generally different from both the Bayes-optimal policy and the policy which is optimal for the expected MDP under the current belief. We then show how these can be applied to obtain robust exploration policies in a Bayesian reinforcement learning setting. △ Less

Submitted 11 November, 2011; v1 submitted 18 June, 2011; originally announced June 2011.

Comments: Corrected version. 12 pages, 3 figures, 1 table

arXiv:1104.5687 [pdf, other]

Preference elicitation and inverse reinforcement learning

Authors: Constantin Rothkopf, Christos Dimitrakakis

Abstract: We state the problem of inverse reinforcement learning in terms of preference elicitation, resulting in a principled (Bayesian) statistical formulation. This generalises previous work on Bayesian inverse reinforcement learning and allows us to obtain a posterior distribution on the agent's preferences, policy and optionally, the obtained reward sequence, from observations. We examine the relation… ▽ More We state the problem of inverse reinforcement learning in terms of preference elicitation, resulting in a principled (Bayesian) statistical formulation. This generalises previous work on Bayesian inverse reinforcement learning and allows us to obtain a posterior distribution on the agent's preferences, policy and optionally, the obtained reward sequence, from observations. We examine the relation of the resulting approach to other statistical methods for inverse reinforcement learning via analysis and experimental results. We show that preferences can be determined accurately, even if the observed agent's policy is sub-optimal with respect to its own preferences. In that case, significantly improved policies with respect to the agent's preferences are obtained, compared to both other methods and to the performance of the demonstrated policy. △ Less

Submitted 29 June, 2011; v1 submitted 29 April, 2011; originally announced April 2011.

Comments: 13 pages, 4 figures; ECML 2011

Report number: EPFL-REPORT-165975

arXiv:1009.0278 [pdf, ps, other]

Expected loss analysis of thresholded authentication protocols in noisy conditions

Authors: Christos Dimitrakakis, Aikaterini Mitrokotsa, Serge Vaudenay

Abstract: A number of authentication protocols have been proposed recently, where at least some part of the authentication is performed during a phase, lasting $n$ rounds, with no error correction. This requires assigning an acceptable threshold for the number of detected errors. This paper describes a framework enabling an expected loss analysis for all the protocols in this family. Furthermore, computatio… ▽ More A number of authentication protocols have been proposed recently, where at least some part of the authentication is performed during a phase, lasting $n$ rounds, with no error correction. This requires assigning an acceptable threshold for the number of detected errors. This paper describes a framework enabling an expected loss analysis for all the protocols in this family. Furthermore, computationally simple methods to obtain nearly optimal value of the threshold, as well as for the number of rounds is suggested. Finally, a method to adaptively select both the number of rounds and the threshold is proposed. △ Less

Submitted 1 September, 2010; originally announced September 2010.

Comments: 17 pages, 2 figures; draft

arXiv:1005.2263 [pdf, other]

Context models on sequences of covers

Authors: Christos Dimitrakakis

Abstract: We present a class of models that, via a simple construction, enables exact, incremental, non-parametric, polynomial-time, Bayesian inference of conditional measures. The approach relies upon creating a sequence of covers on the conditioning variable and maintaining a different model for each set within a cover. Inference remains tractable by specifying the probabilistic model in terms of a random… ▽ More We present a class of models that, via a simple construction, enables exact, incremental, non-parametric, polynomial-time, Bayesian inference of conditional measures. The approach relies upon creating a sequence of covers on the conditioning variable and maintaining a different model for each set within a cover. Inference remains tractable by specifying the probabilistic model in terms of a random walk within the sequence of covers. We demonstrate the approach on problems of conditional density estimation, which, to our knowledge is the first closed-form, non-parametric Bayesian approach to this problem. △ Less

Submitted 30 May, 2011; v1 submitted 13 May, 2010; originally announced May 2010.

Comments: 14 pages, 2 figures

arXiv:0912.5029 [pdf, ps, other]

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Authors: Christos Dimitrakakis

Abstract: There has been a lot of recent work on Bayesian methods for reinforcement learning exhibiting near-optimal online performance. The main obstacle facing such methods is that in most problems of interest, the optimal solution involves planning in an infinitely large tree. However, it is possible to obtain stochastic lower and upper bounds on the value of each tree node. This enables us to use stoc… ▽ More There has been a lot of recent work on Bayesian methods for reinforcement learning exhibiting near-optimal online performance. The main obstacle facing such methods is that in most problems of interest, the optimal solution involves planning in an infinitely large tree. However, it is possible to obtain stochastic lower and upper bounds on the value of each tree node. This enables us to use stochastic branch and bound algorithms to search the tree efficiently. This paper proposes two such algorithms and examines their complexity in this setting. △ Less

Submitted 26 December, 2009; originally announced December 2009.

Comments: 13 pages, 1 figure, ICAART 2010

Report number: TR-UVA-09-01

arXiv:0910.0483 [pdf, ps, other]

Statistical Decision Making for Authentication and Intrusion Detection

Authors: Christos Dimitrakakis, Aikaterini Mitrokotsa

Abstract: User authentication and intrusion detection differ from standard classification problems in that while we have data generated from legitimate users, impostor or intrusion data is scarce or non-existent. We review existing techniques for dealing with this problem and propose a novel alternative based on a principled statistical decision-making view point. We examine the technique on a toy problem… ▽ More User authentication and intrusion detection differ from standard classification problems in that while we have data generated from legitimate users, impostor or intrusion data is scarce or non-existent. We review existing techniques for dealing with this problem and propose a novel alternative based on a principled statistical decision-making view point. We examine the technique on a toy problem and validate it on complex real-world data from an RFID based access control system. The results indicate that it can significantly outperform the classical world model approach. The method could be more generally useful in other decision-making scenarios where there is a lack of adversary data. △ Less

Submitted 5 October, 2009; originally announced October 2009.

Comments: 13 pages, 2 figures, to be presented at ICMLA 2009

Report number: IAS-UVA-09-02

arXiv:0906.4618 [pdf, other]

Shedding Light on RFID Distance Bounding Protocols and Terrorist Fraud Attacks

Authors: Pedro Peris-Lopez, Julio C. Hernandez-Castro, Christos Dimitrakakis, Aikaterini Mitrokotsa, Juan M. E. Tapiador

Abstract: The vast majority of RFID authentication protocols assume the proximity between readers and tags due to the limited range of the radio channel. However, in real scenarios an intruder can be located between the prover (tag) and the verifier (reader) and trick this last one into thinking that the prover is in close proximity. This attack is generally known as a relay attack in which scope distance f… ▽ More The vast majority of RFID authentication protocols assume the proximity between readers and tags due to the limited range of the radio channel. However, in real scenarios an intruder can be located between the prover (tag) and the verifier (reader) and trick this last one into thinking that the prover is in close proximity. This attack is generally known as a relay attack in which scope distance fraud, mafia fraud and terrorist attacks are included. Distance bounding protocols represent a promising countermeasure to hinder relay attacks. Several protocols have been proposed during the last years but vulnerabilities of major or minor relevance have been identified in most of them. In 2008, Kim et al. [1] proposed a new distance bounding protocol with the objective of being the best in terms of security, privacy, tag computational overhead and fault tolerance. In this paper, we analyze this protocol and we present a passive full disclosure attack, which allows an adversary to discover the long-term secret key of the tag. The presented attack is very relevant, since no security objectives are met in Kim et al.'s protocol. Then, design guidelines are introduced with the aim of facilitating protocol designers the stimulating task of designing secure and efficient schemes against relay attacks. Finally a new protocol, named Hitomi and inspired by [1], is designed conforming the guidelines proposed previously. △ Less

Submitted 20 June, 2010; v1 submitted 25 June, 2009; originally announced June 2009.

Comments: 31 pages, 10 figures, 1 table

arXiv:0902.0392 [pdf, ps, other]

Tree Exploration for Bayesian RL Exploration

Authors: Christos Dimitrakakis

Abstract: Research in reinforcement learning has produced algorithms for optimal decision making under uncertainty that fall within two main types. The first employs a Bayesian framework, where optimality improves with increased computational time. This is because the resulting planning task takes the form of a dynamic programming problem on a belief tree with an infinite number of states. The second type e… ▽ More Research in reinforcement learning has produced algorithms for optimal decision making under uncertainty that fall within two main types. The first employs a Bayesian framework, where optimality improves with increased computational time. This is because the resulting planning task takes the form of a dynamic programming problem on a belief tree with an infinite number of states. The second type employs relatively simple algorithm which are shown to suffer small regret within a distribution-free framework. This paper presents a lower bound and a high probability upper bound on the optimal value function for the nodes in the Bayesian belief tree, which are analogous to similar bounds in POMDPs. The bounds are then used to create more efficient strategies for exploring the tree. The resulting algorithms are compared with the distribution-free algorithm UCB1, as well as a simpler baseline algorithm on multi-armed bandit problems. △ Less

Submitted 21 September, 2011; v1 submitted 2 February, 2009; originally announced February 2009.

Comments: 13 pages, 1 figure. Slightly extended and corrected version (notation errors and lower bound calculation) of homonymous paper presented at the conference of Computational Intelligence for Modelling, Control and Automation 2008 (CIMCA'08)

Report number: IAS-08-04

arXiv:0807.2043 [pdf]

Intrusion Detection Using Cost-Sensitive Classification

Authors: Aikaterini Mitrokotsa, Christos Dimitrakakis, Christos Douligeris

Abstract: Intrusion Detection is an invaluable part of computer networks defense. An important consideration is the fact that raising false alarms carries a significantly lower cost than not detecting at- tacks. For this reason, we examine how cost-sensitive classification methods can be used in Intrusion Detection systems. The performance of the approach is evaluated under different experimental conditio… ▽ More Intrusion Detection is an invaluable part of computer networks defense. An important consideration is the fact that raising false alarms carries a significantly lower cost than not detecting at- tacks. For this reason, we examine how cost-sensitive classification methods can be used in Intrusion Detection systems. The performance of the approach is evaluated under different experimental conditions, cost matrices and different classification models, in terms of expected cost, as well as detection and false alarm rates. We find that even under unfavourable conditions, cost-sensitive classification can improve performance significantly, if only slightly. △ Less

Submitted 13 July, 2008; originally announced July 2008.

Comments: 13 pages, 6 figures, presented at EC2ND 2007

Showing 1–50 of 54 results for author: Dimitrakakis, C