Search | arXiv e-print repository

A stochastic game framework for patrolling a border

Authors: Matthew Darlington, Kevin D. Glazebrook, David S. Leslie, Rob Shone, Roberto Szechtman

Abstract: In this paper we consider a stochastic game for modelling the interactions between smugglers and a patroller along a border. The problem we examine involves a group of cooperating smugglers making regular attempts to bring small amounts of illicit goods across a border. A single patroller has the goal of preventing the smugglers from doing so, but must pay a cost to travel from one location to ano… ▽ More In this paper we consider a stochastic game for modelling the interactions between smugglers and a patroller along a border. The problem we examine involves a group of cooperating smugglers making regular attempts to bring small amounts of illicit goods across a border. A single patroller has the goal of preventing the smugglers from doing so, but must pay a cost to travel from one location to another. We model the problem as a two-player stochastic game and look to find the Nash equilibrium to gain insight to real world problems. Our framework extends the literature by assuming that the smugglers choose a continuous quantity of contraband, complicating the analysis of the game. We discuss a number of properties of Nash equilibria, including the aggregation of smugglers, the discount factors of the players, and the equivalence to a zero-sum game. Additionally, we present algorithms to find Nash equilibria that are more computationally efficient than existing methods. We also consider certain assumptions on the parameters of the model that give interesting equilibrium strategies for the players. △ Less

Submitted 20 May, 2022; originally announced May 2022.

arXiv:2111.03340 [pdf, other]

doi 10.1145/3460231.3474607

FINN.no Slates Dataset: A new Sequential Dataset Logging Interactions, allViewed Items and Click Responses/No-Click for Recommender Systems Research

Authors: Simen Eide, Arnoldo Frigessi, Helge Jenssen, David S. Leslie, Joakim Rishaug, Sofie Verrewaere

Abstract: We present a novel recommender systems dataset that records the sequential interactions between users and an online marketplace. The users are sequentially presented with both recommendations and search results in the form of ranked lists of items, called slates, from the marketplace. The dataset includes the presented slates at each round, whether the user clicked on any of these items and which… ▽ More We present a novel recommender systems dataset that records the sequential interactions between users and an online marketplace. The users are sequentially presented with both recommendations and search results in the form of ranked lists of items, called slates, from the marketplace. The dataset includes the presented slates at each round, whether the user clicked on any of these items and which item the user clicked on. Although the usage of exposure data in recommender systems is growing, to our knowledge there is no open large-scale recommender systems dataset that includes the slates of items presented to the users at each interaction. As a result, most articles on recommender systems do not utilize this exposure information. Instead, the proposed models only depend on the user's click responses, and assume that the user is exposed to all the items in the item universe at each step, often called uniform candidate sampling. This is an incomplete assumption, as it takes into account items the user might not have been exposed to. This way items might be incorrectly considered as not of interest to the user. Taking into account the actually shown slates allows the models to use a more natural likelihood, based on the click probability given the exposure set of items, as is prevalent in the bandit and reinforcement learning literature. \cite{Eide2021DynamicSampling} shows that likelihoods based on uniform candidate sampling (and similar assumptions) are implicitly assuming that the platform only shows the most relevant items to the user. This causes the recommender system to implicitly reinforce feedback loops and to be biased towards previously exposed items to the user. △ Less

Submitted 5 November, 2021; originally announced November 2021.

Comments: 5 pages, Fifteen ACM Conference on Recommender Systems (recsys21), 2021, Amsterdam, Netherlands

arXiv:2109.14412 [pdf, other]

Apple Tasting Revisited: Bayesian Approaches to Partially Monitored Online Binary Classification

Authors: James A. Grant, David S. Leslie

Abstract: We consider a variant of online binary classification where a learner sequentially assigns labels ($0$ or $1$) to items with unknown true class. If, but only if, the learner chooses label $1$ they immediately observe the true label of the item. The learner faces a trade-off between short-term classification accuracy and long-term information gain. This problem has previously been studied under the… ▽ More We consider a variant of online binary classification where a learner sequentially assigns labels ($0$ or $1$) to items with unknown true class. If, but only if, the learner chooses label $1$ they immediately observe the true label of the item. The learner faces a trade-off between short-term classification accuracy and long-term information gain. This problem has previously been studied under the name of the `apple tasting' problem. We revisit this problem as a partial monitoring problem with side information, and focus on the case where item features are linked to true classes via a logistic regression model. Our principal contribution is a study of the performance of Thompson Sampling (TS) for this problem. Using recently developed information-theoretic tools, we show that TS achieves a Bayesian regret bound of an improved order to previous approaches. Further, we experimentally verify that efficient approximations to TS and Information Directed Sampling via Pólya-Gamma augmentation have superior empirical performance to existing methods. △ Less

Submitted 22 April, 2024; v1 submitted 29 September, 2021; originally announced September 2021.

Comments: Update to Theorem 1 and experimental work

arXiv:2106.02748 [pdf, other]

Decentralized Q-Learning in Zero-sum Markov Games

Authors: Muhammed O. Sayin, Kaiqing Zhang, David S. Leslie, Tamer Basar, Asuman Ozdaglar

Abstract: We study multi-agent reinforcement learning (MARL) in infinite-horizon discounted zero-sum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized controller, but only based on their own payoffs and local actions executed. The agents need not observe the opponent's actions or payoffs, possibly being ev… ▽ More We study multi-agent reinforcement learning (MARL) in infinite-horizon discounted zero-sum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized controller, but only based on their own payoffs and local actions executed. The agents need not observe the opponent's actions or payoffs, possibly being even oblivious to the presence of the opponent, nor be aware of the zero-sum structure of the underlying game, a setting also referred to as radically uncoupled in the literature of learning in games. In this paper, we develop a radically uncoupled Q-learning dynamics that is both rational and convergent: the learning dynamics converges to the best response to the opponent's strategy when the opponent follows an asymptotically stationary strategy; when both agents adopt the learning dynamics, they converge to the Nash equilibrium of the game. The key challenge in this decentralized setting is the non-stationarity of the environment from an agent's perspective, since both her own payoffs and the system evolution depend on the actions of other agents, and each agent adapts her policies simultaneously and independently. To address this issue, we develop a two-timescale learning dynamics where each agent updates her local Q-function and value function estimates concurrently, with the latter happening at a slower timescale. △ Less

Submitted 12 December, 2021; v1 submitted 4 June, 2021; originally announced June 2021.

Comments: To appear at NeurIPS 2021. Strengthened the results in Theorem 1 and Corollary 1

arXiv:2104.15046 [pdf, other]

Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sampling

Authors: Simen Eide, David S. Leslie, Arnoldo Frigessi

Abstract: We consider the problem of recommending relevant content to users of an internet platform in the form of lists of items, called slates. We introduce a variational Bayesian Recurrent Neural Net recommender system that acts on time series of interactions between the internet platform and the user, and which scales to real world industrial situations. The recommender system is tested both online on r… ▽ More We consider the problem of recommending relevant content to users of an internet platform in the form of lists of items, called slates. We introduce a variational Bayesian Recurrent Neural Net recommender system that acts on time series of interactions between the internet platform and the user, and which scales to real world industrial situations. The recommender system is tested both online on real users, and on an offline dataset collected from a Norwegian web-based marketplace, FINN.no, that is made public for research. This is one of the first publicly available datasets which includes all the slates that are presented to users as well as which items (if any) in the slates were clicked on. Such a data set allows us to move beyond the common assumption that implicitly assumes that users are considering all possible items at each interaction. Instead we build our likelihood using the items that are actually in the slate, and evaluate the strengths and weaknesses of both approaches theoretically and in experiments. We also introduce a hierarchical prior for the item parameters based on group memberships. Both item parameters and user preferences are learned probabilistically. Furthermore, we combine our model with bandit strategies to ensure learning, and introduce `in-slate Thompson Sampling' which makes use of the slates to maximise explorative opportunities. We show experimentally that explorative recommender strategies perform on par or above their greedy counterparts. Even without making use of exploration to learn more effectively, click rates increase simply because of improved diversity in the recommended slates. △ Less

Submitted 30 April, 2021; originally announced April 2021.

Comments: The code and the data used in the article are available in the following repository: https://github.com/finn-no/recsys-slates-dataset

arXiv:2102.03324 [pdf, other]

GIBBON: General-purpose Information-Based Bayesian OptimisatioN

Authors: Henry B. Moss, David S. Leslie, Javier Gonzalez, Paul Rayson

Abstract: This paper describes a general-purpose extension of max-value entropy search, a popular approach for Bayesian Optimisation (BO). A novel approximation is proposed for the information gain -- an information-theoretic quantity central to solving a range of BO problems, including noisy, multi-fidelity and batch optimisations across both continuous and highly-structured discrete spaces. Previously, th… ▽ More This paper describes a general-purpose extension of max-value entropy search, a popular approach for Bayesian Optimisation (BO). A novel approximation is proposed for the information gain -- an information-theoretic quantity central to solving a range of BO problems, including noisy, multi-fidelity and batch optimisations across both continuous and highly-structured discrete spaces. Previously, these problems have been tackled separately within information-theoretic BO, each requiring a different sophisticated approximation scheme, except for batch BO, for which no computationally-lightweight information-theoretic approach has previously been proposed. GIBBON (General-purpose Information-Based Bayesian OptimisatioN) provides a single principled framework suitable for all the above, out-performing existing approaches whilst incurring substantially lower computational overheads. In addition, GIBBON does not require the problem's search space to be Euclidean and so is the first high-performance yet computationally light-weight acquisition function that supports batch BO over general highly structured input spaces like molecular search and gene design. Moreover, our principled derivation of GIBBON yields a natural interpretation of a popular batch BO heuristic based on determinantal point processes. Finally, we analyse GIBBON across a suite of synthetic benchmark tasks, a molecular search loop, and as part of a challenging batch multi-fidelity framework for problems with controllable experimental noise. △ Less

Submitted 26 October, 2021; v1 submitted 5 February, 2021; originally announced February 2021.

Journal ref: Journal of Machine Learning Research 2021

arXiv:2010.00979 [pdf, other]

BOSS: Bayesian Optimization over String Spaces

Authors: Henry B. Moss, Daniel Beck, Javier Gonzalez, David S. Leslie, Paul Rayson

Abstract: This article develops a Bayesian optimization (BO) method which acts directly over raw strings, proposing the first uses of string kernels and genetic algorithms within BO loops. Recent applications of BO over strings have been hindered by the need to map inputs into a smooth and unconstrained latent space. Learning this projection is computationally and data-intensive. Our approach instead builds… ▽ More This article develops a Bayesian optimization (BO) method which acts directly over raw strings, proposing the first uses of string kernels and genetic algorithms within BO loops. Recent applications of BO over strings have been hindered by the need to map inputs into a smooth and unconstrained latent space. Learning this projection is computationally and data-intensive. Our approach instead builds a powerful Gaussian process surrogate model based on string kernels, naturally supporting variable length inputs, and performs efficient acquisition function maximization for spaces with syntactical constraints. Experiments demonstrate considerably improved optimization over existing approaches across a broad range of constraints, including the popular setting where syntax is governed by a context-free grammar. △ Less

Submitted 2 October, 2020; originally announced October 2020.

arXiv:2009.03207 [pdf, other]

Learning to Rank under Multinomial Logit Choice

Authors: James A. Grant, David S. Leslie

Abstract: Learning the optimal ordering of content is an important challenge in website design. The learning to rank (LTR) framework models this problem as a sequential problem of selecting lists of content and observing where users decide to click. Most previous work on LTR assumes that the user considers each item in the list in isolation, and makes binary choices to click or not on each. We introduce a m… ▽ More Learning the optimal ordering of content is an important challenge in website design. The learning to rank (LTR) framework models this problem as a sequential problem of selecting lists of content and observing where users decide to click. Most previous work on LTR assumes that the user considers each item in the list in isolation, and makes binary choices to click or not on each. We introduce a multinomial logit (MNL) choice model to the LTR framework, which captures the behaviour of users who consider the ordered list of items as a whole and make a single choice among all the items and a no-click option. Under the MNL model, the user favours items which are either inherently more attractive, or placed in a preferable position within the list. We propose upper confidence bound (UCB) algorithms to minimise regret in two settings - where the position dependent parameters are known, and unknown. We present theoretical analysis leading to an $Ω(\sqrt{JT})$ lower bound for the problem, an $\tilde{O}(\sqrt{JT})$ upper bound on regret of the UCB algorithm in the known-parameter setting, and an $\tilde{O}(K^2\sqrt{JT})$ upper bound on regret, the first, in the more challenging unknown-position-parameter setting. Our analyses are based on tight new concentration results for Geometric random variables, and novel functional inequalities for maximum likelihood estimators computed on discrete data. △ Less

Submitted 11 May, 2023; v1 submitted 7 September, 2020; originally announced September 2020.

Comments: updated with new material including regret bound for unknown position bias setting

arXiv:2007.00939 [pdf, other]

BOSH: Bayesian Optimization by Sampling Hierarchically

Authors: Henry B. Moss, David S. Leslie, Paul Rayson

Abstract: Deployments of Bayesian Optimization (BO) for functions with stochastic evaluations, such as parameter tuning via cross validation and simulation optimization, typically optimize an average of a fixed set of noisy realizations of the objective function. However, disregarding the true objective function in this manner finds a high-precision optimum of the wrong function. To solve this problem, we p… ▽ More Deployments of Bayesian Optimization (BO) for functions with stochastic evaluations, such as parameter tuning via cross validation and simulation optimization, typically optimize an average of a fixed set of noisy realizations of the objective function. However, disregarding the true objective function in this manner finds a high-precision optimum of the wrong function. To solve this problem, we propose Bayesian Optimization by Sampling Hierarchically (BOSH), a novel BO routine pairing a hierarchical Gaussian process with an information-theoretic framework to generate a growing pool of realizations as the optimization progresses. We demonstrate that BOSH provides more efficient and higher-precision optimization than standard BO across synthetic benchmarks, simulation optimization, reinforcement learning and hyper-parameter tuning tasks. △ Less

Submitted 2 July, 2020; originally announced July 2020.

arXiv:2006.12093 [pdf, other]

MUMBO: MUlti-task Max-value Bayesian Optimization

Authors: Henry B. Moss, David S. Leslie, Paul Rayson

Abstract: We propose MUMBO, the first high-performing yet computationally efficient acquisition function for multi-task Bayesian optimization. Here, the challenge is to perform efficient optimization by evaluating low-cost functions somehow related to our true target function. This is a broad class of problems including the popular task of multi-fidelity optimization. However, while information-theoretic ac… ▽ More We propose MUMBO, the first high-performing yet computationally efficient acquisition function for multi-task Bayesian optimization. Here, the challenge is to perform efficient optimization by evaluating low-cost functions somehow related to our true target function. This is a broad class of problems including the popular task of multi-fidelity optimization. However, while information-theoretic acquisition functions are known to provide state-of-the-art Bayesian optimization, existing implementations for multi-task scenarios have prohibitive computational requirements. Previous acquisition functions have therefore been suitable only for problems with both low-dimensional parameter spaces and function query costs sufficiently large to overshadow very significant optimization overheads. In this work, we derive a novel multi-task version of entropy search, delivering robust performance with low computational overheads across classic optimization challenges and multi-task hyper-parameter tuning. MUMBO is scalable and efficient, allowing multi-task Bayesian optimization to be deployed in problems with rich parameter and fidelity spaces. △ Less

Submitted 22 June, 2020; originally announced June 2020.

arXiv:2001.02323 [pdf, other]

On Thompson Sampling for Smoother-than-Lipschitz Bandits

Authors: James A. Grant, David S. Leslie

Abstract: Thompson Sampling is a well established approach to bandit and reinforcement learning problems. However its use in continuum armed bandit problems has received relatively little attention. We provide the first bounds on the regret of Thompson Sampling for continuum armed bandits under weak conditions on the function class containing the true function and sub-exponential observation noise. Our boun… ▽ More Thompson Sampling is a well established approach to bandit and reinforcement learning problems. However its use in continuum armed bandit problems has received relatively little attention. We provide the first bounds on the regret of Thompson Sampling for continuum armed bandits under weak conditions on the function class containing the true function and sub-exponential observation noise. Our bounds are realised by analysis of the eluder dimension, a recently proposed measure of the complexity of a function class, which has been demonstrated to be useful in bounding the Bayesian regret of Thompson Sampling for simpler bandit problems under sub-Gaussian observation noise. We derive a new bound on the eluder dimension for classes of functions with Lipschitz derivatives, and generalise previous analyses in multiple regards. △ Less

Submitted 26 February, 2020; v1 submitted 7 January, 2020; originally announced January 2020.

Comments: Accepted to AISTATS 2020. 26 pages, 2 figures

arXiv:1906.12230 [pdf, other]

FIESTA: Fast IdEntification of State-of-The-Art models using adaptive bandit algorithms

Authors: Henry B. Moss, Andrew Moore, David S. Leslie, Paul Rayson

Abstract: We present FIESTA, a model selection approach that significantly reduces the computational resources required to reliably identify state-of-the-art performance from large collections of candidate models. Despite being known to produce unreliable comparisons, it is still common practice to compare model evaluations based on single choices of random seeds. We show that reliable model selection also… ▽ More We present FIESTA, a model selection approach that significantly reduces the computational resources required to reliably identify state-of-the-art performance from large collections of candidate models. Despite being known to produce unreliable comparisons, it is still common practice to compare model evaluations based on single choices of random seeds. We show that reliable model selection also requires evaluations based on multiple train-test splits (contrary to common practice in many shared tasks). Using bandit theory from the statistics literature, we are able to adaptively determine appropriate numbers of data splits and random seeds used to evaluate each model, focusing computational resources on the evaluation of promising models whilst avoiding wasting evaluations on models with lower performance. Furthermore, our user-friendly Python implementation produces confidence guarantees of correctly selecting the optimal model. We evaluate our algorithms by selecting between 8 target-dependent sentiment analysis methods using dramatically fewer model evaluations than current model selection approaches. △ Less

Submitted 28 June, 2019; originally announced June 2019.

Comments: ACL 2019. Code available at: https://github.com/apmoore1/fiesta

arXiv:1905.06821 [pdf, other]

Adaptive Sensor Placement for Continuous Spaces

Authors: James A Grant, Alexis Boukouvalas, Ryan-Rhys Griffiths, David S Leslie, Sattar Vakili, Enrique Munoz de Cote

Abstract: We consider the problem of adaptively placing sensors along an interval to detect stochastically-generated events. We present a new formulation of the problem as a continuum-armed bandit problem with feedback in the form of partial observations of realisations of an inhomogeneous Poisson process. We design a solution method by combining Thompson sampling with nonparametric inference via increasing… ▽ More We consider the problem of adaptively placing sensors along an interval to detect stochastically-generated events. We present a new formulation of the problem as a continuum-armed bandit problem with feedback in the form of partial observations of realisations of an inhomogeneous Poisson process. We design a solution method by combining Thompson sampling with nonparametric inference via increasingly granular Bayesian histograms and derive an $\tilde{O}(T^{2/3})$ bound on the Bayesian regret in $T$ rounds. This is coupled with the design of an efficent optimisation approach to select actions in polynomial time. In simulations we demonstrate our approach to have substantially lower and less variable regret than competitor algorithms. △ Less

Submitted 16 May, 2019; originally announced May 2019.

Comments: 13 pages, accepted to ICML 2019

arXiv:1810.02176 [pdf, other]

Adaptive Policies for Perimeter Surveillance Problems

Authors: James A. Grant, David S. Leslie, Kevin Glazebrook, Roberto Szechtman, Adam N. Letchford

Abstract: Maximising the detection of intrusions is a fundamental and often critical aim of perimeter surveillance. Commonly, this requires a decision-maker to optimally allocate multiple searchers to segments of the perimeter. We consider a scenario where the decision-maker may sequentially update the searchers' allocation, learning from the observed data to improve decisions over time. In this work we pro… ▽ More Maximising the detection of intrusions is a fundamental and often critical aim of perimeter surveillance. Commonly, this requires a decision-maker to optimally allocate multiple searchers to segments of the perimeter. We consider a scenario where the decision-maker may sequentially update the searchers' allocation, learning from the observed data to improve decisions over time. In this work we propose a formal model and solution methods for this sequential perimeter surveillance problem. Our model is a combinatorial multi-armed bandit (CMAB) with Poisson rewards and a novel filtered feedback mechanism - arising from the failure to detect certain intrusions. Our solution method is an upper confidence bound approach and we derive upper and lower bounds on its expected performance. We prove that the gap between these bounds is of constant order, and demonstrate empirically that our approach is more reliable in simulated problems than competing algorithms. △ Less

Submitted 11 November, 2019; v1 submitted 4 October, 2018; originally announced October 2018.

arXiv:1810.01925 [pdf, ps, other]

Bandit learning in concave $N$-person games

Authors: Mario Bravo, David S. Leslie, Panayotis Mertikopoulos

Abstract: This paper examines the long-run behavior of learning with bandit feedback in non-cooperative concave games. The bandit framework accounts for extremely low-information environments where the agents may not even know they are playing a game; as such, the agents' most sensible choice in this setting would be to employ a no-regret learning algorithm. In general, this does not mean that the players'… ▽ More This paper examines the long-run behavior of learning with bandit feedback in non-cooperative concave games. The bandit framework accounts for extremely low-information environments where the agents may not even know they are playing a game; as such, the agents' most sensible choice in this setting would be to employ a no-regret learning algorithm. In general, this does not mean that the players' behavior stabilizes in the long run: no-regret learning may lead to cycles, even with perfect gradient information. However, if a standard monotonicity condition is satisfied, our analysis shows that no-regret learning based on mirror descent with bandit feedback converges to Nash equilibrium with probability $1$. We also derive an upper bound for the convergence rate of the process that nearly matches the best attainable rate for single-agent bandit stochastic optimization. △ Less

Submitted 3 October, 2018; originally announced October 2018.

Comments: 24 pages, 1 figure

MSC Class: Primary 91A10; 91A26; secondary 68Q32; 68T02

arXiv:1806.07139 [pdf, other]

Using J-K fold Cross Validation to Reduce Variance When Tuning NLP Models

Authors: Henry B. Moss, David S. Leslie, Paul Rayson

Abstract: K-fold cross validation (CV) is a popular method for estimating the true performance of machine learning models, allowing model selection and parameter tuning. However, the very process of CV requires random partitioning of the data and so our performance estimates are in fact stochastic, with variability that can be substantial for natural language processing tasks. We demonstrate that these unst… ▽ More K-fold cross validation (CV) is a popular method for estimating the true performance of machine learning models, allowing model selection and parameter tuning. However, the very process of CV requires random partitioning of the data and so our performance estimates are in fact stochastic, with variability that can be substantial for natural language processing tasks. We demonstrate that these unstable estimates cannot be relied upon for effective parameter tuning. The resulting tuned parameters are highly sensitive to how our data is partitioned, meaning that we often select sub-optimal parameter choices and have serious reproducibility issues. Instead, we propose to use the less variable J-K-fold CV, in which J independent K-fold cross validations are used to assess performance. Our main contributions are extending J-K-fold CV from performance estimation to parameter tuning and investigating how to choose J and K. We argue that variability is more important than bias for effective tuning and so advocate lower choices of K than are typically seen in the NLP literature, instead use the saved computation to increase J. To demonstrate the generality of our recommendations we investigate a wide range of case-studies: sentiment classification (both general and target-specific), part-of-speech tagging and document classification. △ Less

Submitted 19 June, 2018; originally announced June 2018.

Comments: COLING 2018. Code available at: https://github.com/henrymoss/COLING2018

arXiv:1705.09605 [pdf, ps, other]

Combinatorial Multi-Armed Bandits with Filtered Feedback

Authors: James A. Grant, David S. Leslie, Kevin Glazebrook, Roberto Szechtman

Abstract: Motivated by problems in search and detection we present a solution to a Combinatorial Multi-Armed Bandit (CMAB) problem with both heavy-tailed reward distributions and a new class of feedback, filtered semibandit feedback. In a CMAB problem an agent pulls a combination of arms from a set $\{1,...,k\}$ in each round, generating random outcomes from probability distributions associated with these a… ▽ More Motivated by problems in search and detection we present a solution to a Combinatorial Multi-Armed Bandit (CMAB) problem with both heavy-tailed reward distributions and a new class of feedback, filtered semibandit feedback. In a CMAB problem an agent pulls a combination of arms from a set $\{1,...,k\}$ in each round, generating random outcomes from probability distributions associated with these arms and receiving an overall reward. Under semibandit feedback it is assumed that the random outcomes generated are all observed. Filtered semibandit feedback allows the outcomes that are observed to be sampled from a second distribution conditioned on the initial random outcomes. This feedback mechanism is valuable as it allows CMAB methods to be applied to sequential search and detection problems where combinatorial actions are made, but the true rewards (number of objects of interest appearing in the round) are not observed, rather a filtered reward (the number of objects the searcher successfully finds, which must by definition be less than the number that appear). We present an upper confidence bound type algorithm, Robust-F-CUCB, and associated regret bound of order $\mathcal{O}(\ln(n))$ to balance exploration and exploitation in the face of both filtering of reward and heavy tailed reward distributions. △ Less

Submitted 26 May, 2017; originally announced May 2017.

Comments: 16 pages

arXiv:1412.0543 [pdf, ps, other]

Game-theoretical control with continuous action sets

Authors: Steven Perkins, Panayotis Mertikopoulos, David S. Leslie

Abstract: Motivated by the recent applications of game-theoretical learning techniques to the design of distributed control systems, we study a class of control problems that can be formulated as potential games with continuous action sets, and we propose an actor-critic reinforcement learning algorithm that provably converges to equilibrium in this class of problems. The method employed is to analyse the l… ▽ More Motivated by the recent applications of game-theoretical learning techniques to the design of distributed control systems, we study a class of control problems that can be formulated as potential games with continuous action sets, and we propose an actor-critic reinforcement learning algorithm that provably converges to equilibrium in this class of problems. The method employed is to analyse the learning process under study through a mean-field dynamical system that evolves in an infinite-dimensional function space (the space of probability distributions over the players' continuous controls). To do so, we extend the theory of finite-dimensional two-timescale stochastic approximation to an infinite-dimensional, Banach space setting, and we prove that the continuous dynamics of the process converge to equilibrium in the case of potential games. These results combine to give a provably-convergent learning algorithm in which players do not need to keep track of the controls selected by the other agents. △ Less

Submitted 1 December, 2014; originally announced December 2014.

Comments: 19 pages

arXiv:1112.2315 [pdf, other]

Adaptive Forgetting Factor Fictitious Play

Authors: Michalis Smyrnakis, David S. Leslie

Abstract: It is now well known that decentralised optimisation can be formulated as a potential game, and game-theoretical learning algorithms can be used to find an optimum. One of the most common learning techniques in game theory is fictitious play. However fictitious play is founded on an implicit assumption that opponents' strategies are stationary. We present a novel variation of fictitious play that… ▽ More It is now well known that decentralised optimisation can be formulated as a potential game, and game-theoretical learning algorithms can be used to find an optimum. One of the most common learning techniques in game theory is fictitious play. However fictitious play is founded on an implicit assumption that opponents' strategies are stationary. We present a novel variation of fictitious play that allows the use of a more realistic model of opponent strategy. It uses a heuristic approach, from the online streaming data literature, to adaptively update the weights assigned to recently observed actions. We compare the results of the proposed algorithm with those of stochastic and geometric fictitious play in a simple strategic form game, a vehicle target assignment game and a disaster management problem. In all the tests the rate of convergence of the proposed algorithm was similar or better than the variations of fictitious play we compared it with. The new algorithm therefore improves the performance of game-theoretical learning in decentralised optimisation. △ Less

Submitted 10 December, 2011; originally announced December 2011.

Showing 1–19 of 19 results for author: Leslie, D S