Search | arXiv e-print repository

Defining Expertise: Applications to Treatment Effect Estimation

Authors: Alihan Hüyük, Qiyao Wei, Alicia Curth, Mihaela van der Schaar

Abstract: Decision-makers are often experts of their domain and take actions based on their domain knowledge. Doctors, for instance, may prescribe treatments by predicting the likely outcome of each available treatment. Actions of an expert thus naturally encode part of their domain knowledge, and can help make inferences within the same domain: Knowing doctors try to prescribe the best treatment for their… ▽ More Decision-makers are often experts of their domain and take actions based on their domain knowledge. Doctors, for instance, may prescribe treatments by predicting the likely outcome of each available treatment. Actions of an expert thus naturally encode part of their domain knowledge, and can help make inferences within the same domain: Knowing doctors try to prescribe the best treatment for their patients, we can tell treatments prescribed more frequently are likely to be more effective. Yet in machine learning, the fact that most decision-makers are experts is often overlooked, and "expertise" is seldom leveraged as an inductive bias. This is especially true for the literature on treatment effect estimation, where often the only assumption made about actions is that of overlap. In this paper, we argue that expertise - particularly the type of expertise the decision-makers of a domain are likely to have - can be informative in designing and selecting methods for treatment effect estimation. We formally define two types of expertise, predictive and prognostic, and demonstrate empirically that: (i) the prominent type of expertise in a domain significantly influences the performance of different methods in treatment effect estimation, and (ii) it is possible to predict the type of expertise present in a dataset, which can provide a quantitative basis for model selection. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: The 12th International Conference on Learning Representations (ICLR 2024)

arXiv:2401.17205 [pdf, other]

Adaptive Experiment Design with Synthetic Controls

Authors: Alihan Hüyük, Zhaozhi Qian, Mihaela van der Schaar

Abstract: Clinical trials are typically run in order to understand the effects of a new treatment on a given population of patients. However, patients in large populations rarely respond the same way to the same treatment. This heterogeneity in patient responses necessitates trials that investigate effects on multiple subpopulations - especially when a treatment has marginal or no benefit for the overall po… ▽ More Clinical trials are typically run in order to understand the effects of a new treatment on a given population of patients. However, patients in large populations rarely respond the same way to the same treatment. This heterogeneity in patient responses necessitates trials that investigate effects on multiple subpopulations - especially when a treatment has marginal or no benefit for the overall population but might have significant benefit for a particular subpopulation. Motivated by this need, we propose Syntax, an exploratory trial design that identifies subpopulations with positive treatment effect among many subpopulations. Syntax is sample efficient as it (i) recruits and allocates patients adaptively and (ii) estimates treatment effects by forming synthetic controls for each subpopulation that combines control samples from other subpopulations. We validate the performance of Syntax and provide insights into when it might have an advantage over conventional trial designs through experiments. △ Less

Submitted 9 February, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: Proceedings of the 27th International Conference on Artificial Intelligence and Statistics

arXiv:2311.14110 [pdf, other]

When is Off-Policy Evaluation Useful? A Data-Centric Perspective

Authors: Hao Sun, Alex J. Chan, Nabeel Seedat, Alihan Hüyük, Mihaela van der Schaar

Abstract: Evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging. On the one hand, it brings opportunities for safe policy improvement under high-stakes scenarios like clinical guidelines. On the other hand, such opportunities raise a need for precise off-policy evaluation (OPE). While previous work on OPE focused on improving the algorithm in value esti… ▽ More Evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging. On the one hand, it brings opportunities for safe policy improvement under high-stakes scenarios like clinical guidelines. On the other hand, such opportunities raise a need for precise off-policy evaluation (OPE). While previous work on OPE focused on improving the algorithm in value estimation, in this work, we emphasize the importance of the offline dataset, hence putting forward a data-centric framework for evaluating OPE problems. We propose DataCOPE, a data-centric framework for evaluating OPE, that answers the questions of whether and to what extent we can evaluate a target policy given a dataset. DataCOPE (1) forecasts the overall performance of OPE algorithms without access to the environment, which is especially useful before real-world deployment where evaluating OPE is impossible; (2) identifies the sub-group in the dataset where OPE can be inaccurate; (3) permits evaluations of datasets or data-collection strategies for OPE problems. Our empirical analysis of DataCOPE in the logged contextual bandit settings using healthcare datasets confirms its ability to evaluate both machine-learning and human expert policies like clinical guidelines. △ Less

Submitted 23 November, 2023; originally announced November 2023.

Comments: Off-Policy Evaluation, Data-Centric AI, Data-Centric Reinforcement Learning, Reinforcement Learning

arXiv:2311.07426 [pdf, other]

Optimising Human-AI Collaboration by Learning Convincing Explanations

Authors: Alex J. Chan, Alihan Huyuk, Mihaela van der Schaar

Abstract: Machine learning models are being increasingly deployed to take, or assist in taking, complicated and high-impact decisions, from quasi-autonomous vehicles to clinical decision support systems. This poses challenges, particularly when models have hard-to-detect failure modes and are able to take actions without oversight. In order to handle this challenge, we propose a method for a collaborative s… ▽ More Machine learning models are being increasingly deployed to take, or assist in taking, complicated and high-impact decisions, from quasi-autonomous vehicles to clinical decision support systems. This poses challenges, particularly when models have hard-to-detect failure modes and are able to take actions without oversight. In order to handle this challenge, we propose a method for a collaborative system that remains safe by having a human ultimately making decisions, while giving the model the best opportunity to convince and debate them with interpretable explanations. However, the most helpful explanation varies among individuals and may be inconsistent across stated preferences. To this end we develop an algorithm, Ardent, to efficiently learn a ranking through interaction and best assist humans complete a task. By utilising a collaborative approach, we can ensure safety and improve performance while addressing transparency and accountability concerns. Ardent enables efficient and effective decision-making by adapting to individual preferences for explanations, which we validate through extensive simulations alongside a user study involving a challenging image classification task, demonstrating consistent improvement over competing systems. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2310.19831 [pdf, other]

Explaining by Imitating: Understanding Decisions by Interpretable Policy Learning

Authors: Alihan Hüyük, Daniel Jarrett, Mihaela van der Schaar

Abstract: Understanding human behavior from observed data is critical for transparency and accountability in decision-making. Consider real-world settings such as healthcare, in which modeling a decision-maker's policy is challenging -- with no access to underlying states, no knowledge of environment dynamics, and no allowance for live experimentation. We desire learning a data-driven representation of deci… ▽ More Understanding human behavior from observed data is critical for transparency and accountability in decision-making. Consider real-world settings such as healthcare, in which modeling a decision-maker's policy is challenging -- with no access to underlying states, no knowledge of environment dynamics, and no allowance for live experimentation. We desire learning a data-driven representation of decision-making behavior that (1) inheres transparency by design, (2) accommodates partial observability, and (3) operates completely offline. To satisfy these key criteria, we propose a novel model-based Bayesian method for interpretable policy learning ("Interpole") that jointly estimates an agent's (possibly biased) belief-update process together with their (possibly suboptimal) belief-action map**. Through experiments on both simulated and real-world data for the problem of Alzheimer's disease diagnosis, we illustrate the potential of our approach as an investigative device for auditing, quantifying, and understanding human decision-making behavior. △ Less

Submitted 28 October, 2023; originally announced October 2023.

Journal ref: In Proc. 9th International Conference on Learning Representations (ICLR 2021)

arXiv:2310.18601 [pdf, other]

Online Decision Mediation

Authors: Daniel Jarrett, Alihan Hüyük, Mihaela van der Schaar

Abstract: Consider learning a decision support assistant to serve as an intermediary between (oracle) expert behavior and (imperfect) human behavior: At each time, the algorithm observes an action chosen by a fallible agent, and decides whether to *accept* that agent's decision, *intervene* with an alternative, or *request* the expert's opinion. For instance, in clinical diagnosis, fully-autonomous machine… ▽ More Consider learning a decision support assistant to serve as an intermediary between (oracle) expert behavior and (imperfect) human behavior: At each time, the algorithm observes an action chosen by a fallible agent, and decides whether to *accept* that agent's decision, *intervene* with an alternative, or *request* the expert's opinion. For instance, in clinical diagnosis, fully-autonomous machine behavior is often beyond ethical affordances, thus real-world decision support is often limited to monitoring and forecasting. Instead, such an intermediary would strike a prudent balance between the former (purely prescriptive) and latter (purely descriptive) approaches, while providing an efficient interface between human mistakes and expert feedback. In this work, we first formalize the sequential problem of *online decision mediation* -- that is, of simultaneously learning and evaluating mediator policies from scratch with *abstentive feedback*: In each round, deferring to the oracle obviates the risk of error, but incurs an upfront penalty, and reveals the otherwise hidden expert action as a new training data point. Second, we motivate and propose a solution that seeks to trade off (immediate) loss terms against (future) improvements in generalization error; in doing so, we identify why conventional bandit algorithms may fail. Finally, through experiments and sensitivities on a variety of datasets, we illustrate consistent gains over applicable benchmarks on performance measures with respect to the mediator policy, the learned model, and the decision-making system as a whole. △ Less

Submitted 28 October, 2023; originally announced October 2023.

Journal ref: In Proc. 36th International Conference on Neural Information Processing Systems (NeurIPS 2022)

arXiv:2310.18591 [pdf, other]

Inverse Decision Modeling: Learning Interpretable Representations of Behavior

Authors: Daniel Jarrett, Alihan Hüyük, Mihaela van der Schaar

Abstract: Decision analysis deals with modeling and enhancing decision processes. A principal challenge in improving behavior is in obtaining a transparent description of existing behavior in the first place. In this paper, we develop an expressive, unifying perspective on inverse decision modeling: a framework for learning parameterized representations of sequential decision behavior. First, we formalize t… ▽ More Decision analysis deals with modeling and enhancing decision processes. A principal challenge in improving behavior is in obtaining a transparent description of existing behavior in the first place. In this paper, we develop an expressive, unifying perspective on inverse decision modeling: a framework for learning parameterized representations of sequential decision behavior. First, we formalize the forward problem (as a normative standard), subsuming common classes of control behavior. Second, we use this to formalize the inverse problem (as a descriptive model), generalizing existing work on imitation/reward learning -- while opening up a much broader class of research problems in behavior representation. Finally, we instantiate this approach with an example (inverse bounded rational control), illustrating how this structure enables learning (interpretable) representations of (bounded) rationality -- while naturally capturing intuitive notions of suboptimal actions, biased beliefs, and imperfect knowledge of environments. △ Less

Submitted 28 October, 2023; originally announced October 2023.

Journal ref: In Proc. 38th International Conference on Machine Learning (ICML 2021)

arXiv:2310.07747 [pdf, other]

Accountability in Offline Reinforcement Learning: Explaining Decisions with a Corpus of Examples

Authors: Hao Sun, Alihan Hüyük, Daniel Jarrett, Mihaela van der Schaar

Abstract: Learning controllers with offline data in decision-making systems is an essential area of research due to its potential to reduce the risk of applications in real-world systems. However, in responsibility-sensitive settings such as healthcare, decision accountability is of paramount importance, yet has not been adequately addressed by the literature. This paper introduces the Accountable Offline C… ▽ More Learning controllers with offline data in decision-making systems is an essential area of research due to its potential to reduce the risk of applications in real-world systems. However, in responsibility-sensitive settings such as healthcare, decision accountability is of paramount importance, yet has not been adequately addressed by the literature. This paper introduces the Accountable Offline Controller (AOC) that employs the offline dataset as the Decision Corpus and performs accountable control based on a tailored selection of examples, referred to as the Corpus Subset. AOC operates effectively in low-data scenarios, can be extended to the strictly offline imitation setting, and displays qualities of both conservation and adaptability. We assess AOC's performance in both simulated and real-world healthcare scenarios, emphasizing its capability to manage offline control tasks with high levels of performance while maintaining accountability. △ Less

Submitted 27 October, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

arXiv:2309.06553 [pdf, other]

Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL

Authors: Hao Sun, Alihan Hüyük, Mihaela van der Schaar

Abstract: In this study, we aim to enhance the arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization. We identify a previously overlooked objective of query dependency in such optimization and elucidate two ensuing challenges that impede the successful and economical design of prompt optimization techniques. One primary issue is the absence of an effective method… ▽ More In this study, we aim to enhance the arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization. We identify a previously overlooked objective of query dependency in such optimization and elucidate two ensuing challenges that impede the successful and economical design of prompt optimization techniques. One primary issue is the absence of an effective method to evaluate prompts during inference when the golden answer is unavailable. Concurrently, learning via interactions with the LLMs to navigate the expansive natural language prompting space proves to be resource-intensive. To address this, we introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data. Such data exists as by-products when diverse prompts are benchmarked on open-accessible datasets. With Prompt-OIRL, the query-dependent prompt optimization objective is achieved by first learning an offline reward model. This model can evaluate any query-prompt pairs without accessing LLMs. Subsequently, a best-of-N strategy is deployed to recommend the optimal prompt. Our experimental evaluations across various LLM scales and arithmetic reasoning datasets underscore both the efficacy and economic viability of the proposed approach. △ Less

Submitted 7 March, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

arXiv:2302.12604 [pdf, other]

Neural Laplace Control for Continuous-time Delayed Systems

Authors: Samuel Holt, Alihan Hüyük, Zhaozhi Qian, Hao Sun, Mihaela van der Schaar

Abstract: Many real-world offline reinforcement learning (RL) problems involve continuous-time environments with delays. Such environments are characterized by two distinctive features: firstly, the state x(t) is observed at irregular time intervals, and secondly, the current action a(t) only affects the future state x(t + g) with an unknown delay g > 0. A prime example of such an environment is satellite c… ▽ More Many real-world offline reinforcement learning (RL) problems involve continuous-time environments with delays. Such environments are characterized by two distinctive features: firstly, the state x(t) is observed at irregular time intervals, and secondly, the current action a(t) only affects the future state x(t + g) with an unknown delay g > 0. A prime example of such an environment is satellite control where the communication link between earth and a satellite causes irregular observations and delays. Existing offline RL algorithms have achieved success in environments with irregularly observed states in time or known delays. However, environments involving both irregular observations in time and unknown delays remains an open and challenging problem. To this end, we propose Neural Laplace Control, a continuous-time model-based offline RL method that combines a Neural Laplace dynamics model with a model predictive control (MPC) planner--and is able to learn from an offline dataset sampled with irregular time intervals from an environment that has a inherent unknown constant delay. We show experimentally on continuous-time delayed environments it is able to achieve near expert policy performance. △ Less

Submitted 10 April, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

ACM Class: I.2.6; I.2.5; E.1

arXiv:2208.05844 [pdf, other]

Adaptive Identification of Populations with Treatment Benefit in Clinical Trials: Machine Learning Challenges and Solutions

Authors: Alicia Curth, Alihan Hüyük, Mihaela van der Schaar

Abstract: We study the problem of adaptively identifying patient subpopulations that benefit from a given treatment during a confirmatory clinical trial. This type of adaptive clinical trial has been thoroughly studied in biostatistics, but has been allowed only limited adaptivity so far. Here, we aim to relax classical restrictions on such designs and investigate how to incorporate ideas from the recent ma… ▽ More We study the problem of adaptively identifying patient subpopulations that benefit from a given treatment during a confirmatory clinical trial. This type of adaptive clinical trial has been thoroughly studied in biostatistics, but has been allowed only limited adaptivity so far. Here, we aim to relax classical restrictions on such designs and investigate how to incorporate ideas from the recent machine learning literature on adaptive and online experimentation to make trials more flexible and efficient. We find that the unique characteristics of the subpopulation selection problem -- most importantly that (i) one is usually interested in finding subpopulations with any treatment benefit (and not necessarily the single subgroup with largest effect) given a limited budget and that (ii) effectiveness only has to be demonstrated across the subpopulation on average -- give rise to interesting challenges and new desiderata when designing algorithmic solutions. Building on these findings, we propose AdaGGI and AdaGCPI, two meta-algorithms for subpopulation construction. We empirically investigate their performance across a range of simulation scenarios and derive insights into their (dis)advantages across different settings. △ Less

Submitted 5 June, 2023; v1 submitted 11 August, 2022; originally announced August 2022.

Comments: To appear in the Proceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023

arXiv:2202.10153 [pdf, other]

Inferring Lexicographically-Ordered Rewards from Preferences

Authors: Alihan Hüyük, William R. Zame, Mihaela van der Schaar

Abstract: Modeling the preferences of agents over a set of alternatives is a principal concern in many areas. The dominant approach has been to find a single reward/utility function with the property that alternatives yielding higher rewards are preferred over alternatives yielding lower rewards. However, in many settings, preferences are based on multiple, often competing, objectives; a single reward funct… ▽ More Modeling the preferences of agents over a set of alternatives is a principal concern in many areas. The dominant approach has been to find a single reward/utility function with the property that alternatives yielding higher rewards are preferred over alternatives yielding lower rewards. However, in many settings, preferences are based on multiple, often competing, objectives; a single reward function is not adequate to represent such preferences. This paper proposes a method for inferring multi-objective reward-based representations of an agent's observed preferences. We model the agent's priorities over different objectives as entering lexicographically, so that objectives with lower priorities matter only when the agent is indifferent with respect to objectives with higher priorities. We offer two example applications in healthcare, one inspired by cancer treatment, the other inspired by organ transplantation, to illustrate how the lexicographically-ordered rewards we learn can provide a better understanding of a decision-maker's preferences and help improve policies when used in reinforcement learning. △ Less

Submitted 7 June, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

Comments: In Proceedings of the 36th AAAI Conference on Artificial Intelligence

arXiv:2107.06317 [pdf, other]

Inverse Contextual Bandits: Learning How Behavior Evolves over Time

Authors: Alihan Hüyük, Daniel Jarrett, Mihaela van der Schaar

Abstract: Understanding a decision-maker's priorities by observing their behavior is critical for transparency and accountability in decision processes, such as in healthcare. Though conventional approaches to policy learning almost invariably assume stationarity in behavior, this is hardly true in practice: Medical practice is constantly evolving as clinical professionals fine-tune their knowledge over tim… ▽ More Understanding a decision-maker's priorities by observing their behavior is critical for transparency and accountability in decision processes, such as in healthcare. Though conventional approaches to policy learning almost invariably assume stationarity in behavior, this is hardly true in practice: Medical practice is constantly evolving as clinical professionals fine-tune their knowledge over time. For instance, as the medical community's understanding of organ transplantations has progressed over the years, a pertinent question is: How have actual organ allocation policies been evolving? To give an answer, we desire a policy learning method that provides interpretable representations of decision-making, in particular capturing an agent's non-stationary knowledge of the world, as well as operating in an offline manner. First, we model the evolving behavior of decision-makers in terms of contextual bandits, and formalize the problem of Inverse Contextual Bandits (ICB). Second, we propose two concrete algorithms as solutions, learning parametric and nonparametric representations of an agent's behavior. Finally, using both real and simulated data for liver transplantations, we illustrate the applicability and explainability of our method, as well as benchmarking and validating its accuracy. △ Less

Submitted 8 June, 2022; v1 submitted 13 July, 2021; originally announced July 2021.

Comments: In Proceedings of the 39th International Conference on Machine Learning

arXiv:2106.04240 [pdf, other]

The Medkit-Learn(ing) Environment: Medical Decision Modelling through Simulation

Authors: Alex J. Chan, Ioana Bica, Alihan Huyuk, Daniel Jarrett, Mihaela van der Schaar

Abstract: Understanding decision-making in clinical environments is of paramount importance if we are to bring the strengths of machine learning to ultimately improve patient outcomes. Several factors including the availability of public data, the intrinsically offline nature of the problem, and the complexity of human decision making, has meant that the mainstream development of algorithms is often geared… ▽ More Understanding decision-making in clinical environments is of paramount importance if we are to bring the strengths of machine learning to ultimately improve patient outcomes. Several factors including the availability of public data, the intrinsically offline nature of the problem, and the complexity of human decision making, has meant that the mainstream development of algorithms is often geared towards optimal performance in tasks that do not necessarily translate well into the medical regime; often overlooking more niche issues commonly associated with the area. We therefore present a new benchmarking suite designed specifically for medical sequential decision making: the Medkit-Learn(ing) Environment, a publicly available Python package providing simple and easy access to high-fidelity synthetic medical data. While providing a standardised way to compare algorithms in a realistic medical setting we employ a generating process that disentangles the policy and environment dynamics to allow for a range of customisations, thus enabling systematic evaluation of algorithms' robustness against specific challenges prevalent in healthcare. △ Less

Submitted 14 March, 2022; v1 submitted 8 June, 2021; originally announced June 2021.

arXiv:2007.13531 [pdf, other]

Learning "What-if" Explanations for Sequential Decision-Making

Authors: Ioana Bica, Daniel Jarrett, Alihan Hüyük, Mihaela van der Schaar

Abstract: Building interpretable parameterizations of real-world decision-making on the basis of demonstrated behavior -- i.e. trajectories of observations and actions made by an expert maximizing some unknown reward function -- is essential for introspecting and auditing policies in different institutions. In this paper, we propose learning explanations of expert decisions by modeling their reward function… ▽ More Building interpretable parameterizations of real-world decision-making on the basis of demonstrated behavior -- i.e. trajectories of observations and actions made by an expert maximizing some unknown reward function -- is essential for introspecting and auditing policies in different institutions. In this paper, we propose learning explanations of expert decisions by modeling their reward function in terms of preferences with respect to "what if" outcomes: Given the current history of observations, what would happen if we took a particular action? To learn these cost-benefit tradeoffs associated with the expert's actions, we integrate counterfactual reasoning into batch inverse reinforcement learning. This offers a principled way of defining reward functions and explaining expert behavior, and also satisfies the constraints of real-world decision-making -- where active experimentation is often impossible (e.g. in healthcare). Additionally, by estimating the effects of different actions, counterfactuals readily tackle the off-policy nature of policy evaluation in the batch setting, and can naturally accommodate settings where the expert policies depend on histories of observations rather than just current states. Through illustrative experiments in both real and simulated medical environments, we highlight the effectiveness of our batch, counterfactual inverse reinforcement learning approach in recovering accurate and interpretable descriptions of behavior. △ Less

Submitted 30 March, 2021; v1 submitted 2 July, 2020; originally announced July 2020.

Comments: In Proc. 9th International Conference on Learning Representations (ICLR 2021)

arXiv:1907.11605 [pdf, other]

Lexicographic Multiarmed Bandit

Authors: Alihan Hüyük, Cem Tekin

Abstract: We consider a multiobjective multiarmed bandit problem with lexicographically ordered objectives. In this problem, the goal of the learner is to select arms that are lexicographic optimal as much as possible without knowing the arm reward distributions beforehand. We capture this goal by defining a multidimensional form of regret that measures the loss of the learner due to not selecting lexicogra… ▽ More We consider a multiobjective multiarmed bandit problem with lexicographically ordered objectives. In this problem, the goal of the learner is to select arms that are lexicographic optimal as much as possible without knowing the arm reward distributions beforehand. We capture this goal by defining a multidimensional form of regret that measures the loss of the learner due to not selecting lexicographic optimal arms, and then, consider two settings where the learner has prior information on the expected arm rewards. In the first setting, the learner only knows for each objective the lexicographic optimal expected reward. In the second setting, it only knows for each objective near-lexicographic optimal expected rewards. For both settings we prove that the learner achieves expected regret uniformly bounded in time. The algorithm we propose for the second setting also attains bounded regret for the multiarmed bandit with satisficing objectives. In addition, we also consider the harder prior-free case, and show that the learner can still achieve sublinear in time gap-free regret. Finally, we experimentally evaluate performance of the proposed algorithms in a variety of multiobjective learning problems. △ Less

Submitted 29 July, 2019; v1 submitted 26 July, 2019; originally announced July 2019.

Comments: 22 pages

arXiv:1907.04201 [pdf, other]

doi 10.1109/TNET.2020.3025904

Thompson Sampling for Combinatorial Network Optimization in Unknown Environments

Authors: Alihan Hüyük, Cem Tekin

Abstract: Influence maximization, adaptive routing, and dynamic spectrum allocation all require choosing the right action from a large set of alternatives. Thanks to the advances in combinatorial optimization, these and many similar problems can be efficiently solved given an environment with known stochasticity. In this paper, we take this one step further and focus on combinatorial optimization in unknown… ▽ More Influence maximization, adaptive routing, and dynamic spectrum allocation all require choosing the right action from a large set of alternatives. Thanks to the advances in combinatorial optimization, these and many similar problems can be efficiently solved given an environment with known stochasticity. In this paper, we take this one step further and focus on combinatorial optimization in unknown environments. We consider a very general learning framework called combinatorial multi-armed bandit with probabilistically triggered arms and a very powerful Bayesian algorithm called Combinatorial Thompson Sampling (CTS). Under the semi-bandit feedback model and assuming access to an oracle without knowing the expected base arm outcomes beforehand, we show that when the expected reward is Lipschitz continuous in the expected base arm outcomes CTS achieves $O(\sum_{i =1}^m\log T/(p_iΔ_i))$ regret and $O(\max\{\mathbb{E}[m\sqrt{T\log T/p^*}],\mathbb{E}[m^2/p^*]\})$ Bayesian regret, where $m$ denotes the number of base arms, $p_i$ and $Δ_i$ denote the minimum non-zero triggering probability and the minimum suboptimality gap of base arm $i$ respectively, $T$ denotes the time horizon, and $p^*$ denotes the overall minimum non-zero triggering probability. We also show that when the expected reward satisfies the triggering probability modulated Lipschitz continuity, CTS achieves $O(\max\{m\sqrt{T\log T},m^2\})$ Bayesian regret, and when triggering probabilities are non-zero for all base arms, CTS achieves $O(1/p^*\log(1/p^*))$ regret independent of the time horizon. Finally, we numerically compare CTS with algorithms based on upper confidence bounds in several networking problems and show that CTS outperforms these algorithms by at least an order of magnitude in majority of the cases. △ Less

Submitted 21 September, 2020; v1 submitted 7 July, 2019; originally announced July 2019.

Comments: 14 pages, 3 figures. Accepted for publication in the IEEE/ACM Transactions on Networking. arXiv admin note: text overlap with arXiv:1809.02707

Journal ref: IEEE/ACM Transactions on Networking, vol. 28, no. 6, pp. 2836-2849, Dec. 2020

arXiv:1809.02707 [pdf, other]

Analysis of Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms

Authors: Alihan Hüyük, Cem Tekin

Abstract: We analyze the regret of combinatorial Thompson sampling (CTS) for the combinatorial multi-armed bandit with probabilistically triggered arms under the semi-bandit feedback setting. We assume that the learner has access to an exact optimization oracle but does not know the expected base arm outcomes beforehand. When the expected reward function is Lipschitz continuous in the expected base arm outc… ▽ More We analyze the regret of combinatorial Thompson sampling (CTS) for the combinatorial multi-armed bandit with probabilistically triggered arms under the semi-bandit feedback setting. We assume that the learner has access to an exact optimization oracle but does not know the expected base arm outcomes beforehand. When the expected reward function is Lipschitz continuous in the expected base arm outcomes, we derive $O(\sum_{i =1}^m \log T / (p_i Δ_i))$ regret bound for CTS, where $m$ denotes the number of base arms, $p_i$ denotes the minimum non-zero triggering probability of base arm $i$ and $Δ_i$ denotes the minimum suboptimality gap of base arm $i$. We also compare CTS with combinatorial upper confidence bound (CUCB) via numerical experiments on a cascading bandit problem. △ Less

Submitted 19 February, 2019; v1 submitted 7 September, 2018; originally announced September 2018.

Comments: To appear in the Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019

Showing 1–18 of 18 results for author: Hüyük, A