Causal Contextual Bandits with Adaptive Context

Rahul Madhavan,
IISc Bangalore &Aurghya Maiti,
Columbia University &Gaurav Sinha,
Microsoft Research &Siddharth Barman,
IISc Bangalore
Abstract

We study a variant of causal contextual bandits where the context is chosen based on an initial intervention chosen by the learner. At the beginning of each round, the learner selects an initial action, depending on which a stochastic context is revealed by the environment. Following this, the learner then selects a final action and receives a reward. Given T𝑇Titalic_T rounds of interactions with the environment, the objective of the learner is to learn a policy (of selecting the initial and the final action) with maximum expected reward. In this paper we study the specific situation where every action corresponds to intervening on a node in some known causal graph. We extend prior work from the deterministic context setting to obtain simple regret minimization guarantees. This is achieved through an instance-dependent causal parameter, λ𝜆\lambdaitalic_λ, which characterizes our upper bound. Furthermore, we prove that our simple regret is essentially tight for a large class of instances. A key feature of our work is that we use convex optimization to address the bandit exploration problem. We also conduct experiments to validate our theoretical results, and release our code at the project GitHub Repository.

1 Introduction

Recent years have seen an active interest in causal bandits from the research community (Lattimore et al., 2016; Sen et al., 2017a; b; Lee & Bareinboim, 2018; Yabe et al., 2018; Lee & Bareinboim, 2019; Lu et al., 2020; Nair et al., 2021; Lu et al., 2021; 2022; Maiti et al., 2022; Varici et al., 2022; Subramanian & Ravindran, 2022; Xiong & Chen, 2023). In this setting, one assumes an environment comprising of causal variables that are random variables that influence each other as per a given causal (directed, and acyclic) graph. Specifically, the edges in the causal DAG represent causal relationships between variables in the environment. If one of these variables is designated as a reward variable, then the goal of a learner then is to maximize their reward by intervening on certain variables (i.e., by fixing the values of certain variables). The rest of the variables, that are not intervened upon, take values as per their conditional distributions, given their parents in the causal graph. In this work, as is common in literature, we assume that the variables take values in {0,1}01\{0,1\}{ 0 , 1 }. Of particular interest are causal settings wherein the learner is allowed to perform atomic interventions. Here, at most one causal variable can be set to a particular value, while other variables take values in accordance with their underlying distributions.

It is relevant to note that when a learner performs an intervention in a causal graph, they get to observe the values of multiple other variables in the causal graph. Hence, the collective dependence of the reward on the variables is observed through each intervention. That is, from such an observation, the learner may be able to make inferences about the (expected) reward under other values for the causal variables (Peters et al., 2017). In essence, with a single intervention, the learner is allowed to intervene on a variable (in the causal graph), allowed to observe all other variables, and further, is privy to the effects of such an intervention. Indeed, such an observation in a causal graph is richer than a usual sample from a stochastic process. Hence, a standard goal in causal bandits is to understand the power and limitations of interventions. This goal manifests in the form of develo** algorithms that identify intervention(s) that lead to high rewards, while using as few observations/interventions as possible. We use the term intervention complexity (rather than sample complexity) for our algorithm, to emphasize that interventions are richer than samples.

Refer to caption
Figure 1: Flowchart illustrating the decision-making process of an advertiser posting ads on a platform like Amazon, and the subsequent interaction with the platform.

In the learning literature, there are several objectives that an algorithm designer might consider. Cumulative regret, simple regret, and average regret have prominently been studied in literature (Lattimore & Szepesvári, 2020; Slivkins et al., 2019). In this work we focus on minimizing simple regret, wherein the algorithm is given a time budget, up to which it may explore, at which time it has to output a near-optimal policy.

Addressing causal bandits, the notable work of Lattimore et al. (2016) obtains an intervention-complexity bound for minimizing simple regret with a focus on atomic interventions and parallel causal graphs. Maiti et al. (2022) extend this work to obtain intervention-complexity bounds for simple regret in causal graphs with unobserved variables. The work by Lu et al. (2022) extends this setting to causal Markov decision processes (MDPs), while addressing the cumulative regret objective. Combinatorial causal bandits have been studied by Feng & Chen (2023) and Xiong & Chen (2023).

Causal contextual bandits have been studied by Subramanian & Ravindran (2022) where the contexts may be chosen by the learner (rather than be provided by the environment). Here we generalize Subramanian & Ravindran (2022) to a setting where the context is provided by the environment, adaptively, in response to an initial choice of the learner.

Motivating Example: Consider an advertiser looking to post ads on a web-page, say Amazon. They may make requests for a certain type of user demographic to Amazon. Based on this initial request, the platform may actually choose one particular user to show the ad to. At this time, certain details about the user are revealed to the advertiser. For example, the platform may reveal some of the user demographics, as well as certain details about their device. Based on these details, the advertiser may choose one particular ad to show the user. In case the user clicks the ad, the advertiser receives a reward. The goal of the learner is to find optimal choices for initial user preference, as well as ad-content such that user clicks are maximized. We illustrate this example through Figure 1 where we indicate the choices available for template and content interventions.

1.1 Our Contributions

We develop an algorithm to identify near-optimal interventions in causal bandits with adaptive context, and show that the simple regret of such an algorithm is indeed tight for several instances. We highlight the main contributions of our work below.

1. We develop and analyze an algorithm for minimizing simple regret for causal bandits with adaptive context in an intervention efficient manner. We provide an upper-bound on intervention complexity in Theorem 1.

2. Interestingly, the intervention complexity of our algorithm depends on an instance dependent structural parameter—referred to as λ𝜆\lambdaitalic_λ (see equation (3))— which may be much lower than nk𝑛𝑘nkitalic_n italic_k, where n𝑛nitalic_n is the number of interventions and k𝑘kitalic_k is the number of contexts.

3. Notably, our algorithm uses a convex program to identify optimal interventions. Unlike prior work that uses optimization to design exploration (for example see Yabe et al. (2018)), we show (in Appendix Section E) that the optimization problem we design is convex, and is thus computationally efficient. Using convex optimization to design efficient exploration is in fact a distinguishing feature of our work.

4. We provide lower bound guarantees showing that our regret guarantee is tight (up to a log factor) for a large family of instances (see Section 4 and Appendix Section F).

5. We demonstrate using experiments (see Section 5) that our algorithm performs exceeding well as compared to other baselines. We note that this is because λnkmuch-less-than𝜆𝑛𝑘\lambda\ll nkitalic_λ ≪ italic_n italic_k for n𝑛nitalic_n causal variables and k𝑘kitalic_k contexts.

In conclusion, we provide a novel convex-optimization based algorithm for Causal MDP exploration. We analyze the algorithm to come up with an instance dependent parameter λ𝜆\lambdaitalic_λ. Further, we prove that our algorithm is sample efficient (see Theorems 1 and 2).

1.2 Additional Related Work

Description Reference
Simple regret for bandits with parallel causal graphs Lattimore et al. (2016)
Simple regret for atomic soft interventions Sen et al. (2017a)
Simple regret for non-atomic interventions in causal bandits Yabe et al. (2018)
Cumulative regret for general causal graphs Lu et al. (2020)
Simple regret in the presence of unobserved confounders Maiti et al. (2022)
Cumulative regret for unknown causal graph structure Lu et al. (2021)
Cumulative regret for causal contextual bandits with latent confounders Sen et al. (2017b)
Simple and cumulative regret for budgeted causal bandits Nair et al. (2021)
Cumulative regret for Linear SEMs Varici et al. (2022)
Cumulative regret for combinatorial causal bandits Feng & Chen (2023)
Cumulative regret for Causal MDPs Lu et al. (2022)
Best-intervention for combinatorial causal bandits Xiong & Chen (2023)
Additive Causal Bandits with Unknown Graph Malek et al. (2023)
Structural Causal Bandits with Unobserved Confounders Wei et al. (2024)
Confounded Budgeted Causal Bandits Jamshidi et al. (2024)
Cumulative Regret for Causal Bandits with Lipschitz SEMs Yan et al. (2024)
Simple regret for causal contextual bandits Subramanian & Ravindran (2022)
Simple regret for causal contextual bandits with adaptive context Our work
Table 1: Summary of prior work in causal bandits

Ever since the introduction of the causal bandit framework by Lattimore et al. (2016), we have seen multiple works address causal bandits in various degrees of generality and using different modelling assumptions. Sen et al. (2017a) addressed the issue of soft atomic interventions using an importance sampling based approach. Soft interventions in the linear structural equation model (SEM) setting was addressed recently by Varici et al. (2022). Yabe et al. (2018) proposed an optimization based approach for non-atomic interventions. This work was extended by Xiong & Chen (2023) to provide instance dependent regret bounds. They also provide guarantees for binary generalized linear models (BGLMs). The question of unknown causal graph structure was addressed by Lu et al. (2021), whereas Nair et al. (2021) study the case where interventions are more expensive than observations.

Maiti et al. (2022) addressed simple regret for graphs containing hidden confounding causal variables, while cumulative regret in general causal graphs was addressed by Lu et al. (2020). A notable work by Lu et al. (2022) formulates the framework for causal MDPs, and they provide cumulative regret guarantees in this setting. Causal contextual bandits were addressed by Subramanian & Ravindran (2022); Sen et al. (2017b), and we extend these works to adaptive contexts.

We summarize the main works in this thread in Table 1 and provide a more detailed set of related works in Appendix A.

2 Notations and Preliminaries

Refer to caption
((a)) Illustrative figure for causal contextual bandit with adaptive context.
Refer to caption
((b)) Illustrative Figure for Causal Graph at start state and at some intermediate context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ].
Figure 2: The transition to a particular context (chosen context in the figure on the left) is decided by the environment, whereas the interventions at the start state and an intermediate context (chosen interventions in the figure on the right) are chosen by the learner.

We model the causal contextual bandit with adaptive context as a contextual bandit problem with a causal graph corresponding to each context. The actions at each context are given by interventions on the causal graph. Additionally, we have a causal graph at the start state, and the context is stochastically dependent on the intervention on the causal graph at the start state. For ease of notation, we will call the start state of the learner as context 00. The agent starts at context 00, chooses an intervention, then transitions to one of k𝑘kitalic_k contexts [k]={1,,k}delimited-[]𝑘1𝑘[k]=\{1,\dots,k\}[ italic_k ] = { 1 , … , italic_k }, chooses another intervention, and then receives a reward; see Figure 2(a).

Assumptions on the Causal Graph: Formally, let 𝒞𝒞\mathcal{C}caligraphic_C be the set of contexts {0,1,,k}01𝑘\{0,1,\dots,k\}{ 0 , 1 , … , italic_k }. Then, at each context, there is a Causal Bayesian Network (CBN) represented by a causal graph; see Figure 2(b). In particular, at each context i𝒞𝑖𝒞i\in\mathcal{C}italic_i ∈ caligraphic_C, the causal graph is composed of n𝑛nitalic_n variables {X1i,,Xni}subscriptsuperscript𝑋𝑖1subscriptsuperscript𝑋𝑖𝑛\{X^{i}_{1},\dots,X^{i}_{n}\}{ italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. Each Xjisubscriptsuperscript𝑋𝑖𝑗X^{i}_{j}italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT takes values from {0,1}01\{0,1\}{ 0 , 1 }, with an associated conditional probability (of being equal to 0 or 1), given the other variables in the causal graph. We make the following mild assumptions on the causal graph at each context.

  1. 1.

    The distribution of any node Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT conditioned on it’s parents in the causal graph is a Bernoulli random variable with a fixed parameter.

  2. 2.

    The causal graph at each context is semi-Markovian. This is equivalent to making the following assumptions on the graph. No hidden variable in the graph has a parent. Further, every hidden variable has at most two children, both observable.

  3. 3.

    We transform the causal graph for each context as follows: For every hidden variable with two children, we introduce bidirected edges between them. If no path of bidirected edges exists between an intervenable node and its child, the graph is identifiable – a necessary and sufficient condition for estimating the graph’s associated distribution.(Tian & Pearl, 2002).

Table 2: Summary of notations for our paper
Notation Explanation
Context 00 Start state
Context [k]delimited-[]𝑘[k][ italic_k ] Intermediate contexts {1,,k}1𝑘\{1,\dots,k\}{ 1 , … , italic_k }
Xjisubscriptsuperscript𝑋𝑖𝑗X^{i}_{j}italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT Causal Variables: Xji{0,1} for all i[k],j[n]formulae-sequencesubscriptsuperscript𝑋𝑖𝑗01 for all 𝑖delimited-[]𝑘𝑗delimited-[]𝑛X^{i}_{j}\in\{0,1\}\enspace\text{ for all }i\in[k],\enspace j\in[n]italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ { 0 , 1 } for all italic_i ∈ [ italic_k ] , italic_j ∈ [ italic_n ]
do()𝑑𝑜do(\cdot)italic_d italic_o ( ⋅ ) An atomic intervention of the form do()𝑑𝑜do()italic_d italic_o ( ), do(Xji=0)𝑑𝑜superscriptsubscript𝑋𝑗𝑖0do(X_{j}^{i}=0)italic_d italic_o ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = 0 ) or do(Xji=1)𝑑𝑜superscriptsubscript𝑋𝑗𝑖1do(X_{j}^{i}=1)italic_d italic_o ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = 1 )
𝒜isubscript𝒜𝑖\mathcal{A}_{i}caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Set of atomic interventions at context i𝑖iitalic_i
N𝑁Nitalic_N N:=|𝒜i|=2n+1 for all i[k]assign𝑁subscript𝒜𝑖2𝑛1 for all 𝑖delimited-[]𝑘N:=|\mathcal{A}_{i}|=2n+1\enspace\text{ for all }i\in[k]italic_N := | caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = 2 italic_n + 1 for all italic_i ∈ [ italic_k ]
Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Reward on transition from context i𝑖iitalic_i
misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Causal observational threshold at context i{0,,k}𝑖0𝑘i\in\{0,\dots,k\}italic_i ∈ { 0 , … , italic_k }
M𝑀Mitalic_M diagonal matrix of misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT values
PN×k𝑃superscript𝑁𝑘P\in\mathbb{R}^{N\times k}italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_k end_POSTSUPERSCRIPT Transition probabilities matrix: [P(a,i)={ia}]a𝒜0,i[k]subscriptdelimited-[]subscript𝑃𝑎𝑖conditional-set𝑖𝑎formulae-sequence𝑎subscript𝒜0𝑖delimited-[]𝑘\left[P_{(a,i)}=\mathbb{P}\{i\enspace\mid\enspace a\}\right]_{a\in\mathcal{A}_% {0},i\in[k]}[ italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT = blackboard_P { italic_i ∣ italic_a } ] start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT
p+subscript𝑝p_{+}italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT Transition threshold p+=min{P(a,i)P(a,i)>0}subscript𝑝subscript𝑃𝑎𝑖ketsubscript𝑃𝑎𝑖0p_{+}=\min\{P_{(a,i)}\mid P_{(a,i)}>0\}italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = roman_min { italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ∣ italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT > 0 }
π:𝒞𝒜:𝜋𝒞𝒜\pi:\mathcal{C}\to\mathcal{A}italic_π : caligraphic_C → caligraphic_A Policy, a map from contexts to interventions.
i.e. π(i)𝒜i𝜋𝑖subscript𝒜𝑖\pi(i)\in\mathcal{A}_{i}italic_π ( italic_i ) ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i{0}[k]𝑖0delimited-[]𝑘i\in\{0\}\cup[k]italic_i ∈ { 0 } ∪ [ italic_k ]
𝔼[Riπ(i)]𝔼delimited-[]conditionalsubscript𝑅𝑖𝜋𝑖\mathbb{E}\left[R_{i}\mid\pi(i)\right]blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_π ( italic_i ) ] Expectation of the reward at context i𝑖iitalic_i given intervention π(i)𝜋𝑖\pi(i)italic_π ( italic_i )

Interventions: Furthermore, we are allowed atomic interventions, i.e., we can select at most one variable and set it to either 00 or 1111. We will use 𝒜isubscript𝒜𝑖\mathcal{A}_{i}caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote the set of atomic interventions available at context i{0,,k}𝑖0𝑘i\in\{0,\ldots,k\}italic_i ∈ { 0 , … , italic_k }; in particular, 𝒜i={do()}{do(Xji=0),do(Xji=1)}subscript𝒜𝑖𝑑𝑜𝑑𝑜subscriptsuperscript𝑋𝑖𝑗0𝑑𝑜subscriptsuperscript𝑋𝑖𝑗1\mathcal{A}_{i}=\left\{do()\right\}\cup\left\{do(X^{i}_{j}=0),do(X^{i}_{j}=1)\right\}caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_d italic_o ( ) } ∪ { italic_d italic_o ( italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 ) , italic_d italic_o ( italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 ) } for j[n]𝑗delimited-[]𝑛\ j\in[n]italic_j ∈ [ italic_n ]. We note that do()𝑑𝑜do()italic_d italic_o ( ) is an empty intervention that allows all the variables to take values from their underlying conditional distributions. Also, do(Xji=0)𝑑𝑜subscriptsuperscript𝑋𝑖𝑗0do(X^{i}_{j}=0)italic_d italic_o ( italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 ) and do(Xji=1)𝑑𝑜subscriptsuperscript𝑋𝑖𝑗1do(X^{i}_{j}=1)italic_d italic_o ( italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 ) set the value of variable Xjisubscriptsuperscript𝑋𝑖𝑗X^{i}_{j}italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to 00 and 1111, respectively, while leaving all the other variables to independently draw values from their respective distributions. Note that for all i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ], we have |𝒜i|=2n+1subscript𝒜𝑖2𝑛1\lvert\mathcal{A}_{i}\rvert=2n+1| caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = 2 italic_n + 1. Write N:=2n+1assign𝑁2𝑛1N:=2n+1italic_N := 2 italic_n + 1.

Reward: The environment provides the learner with a {0,1}01\{0,1\}{ 0 , 1 } reward upon choosing an intervention at context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ], which we denote as Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Note that Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a stochastic function of variables X1i,,Xnisubscriptsuperscript𝑋𝑖1subscriptsuperscript𝑋𝑖𝑛X^{i}_{1},\dots,X^{i}_{n}italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. In particular, for all j[n]𝑗delimited-[]𝑛j\in[n]italic_j ∈ [ italic_n ] and each realization Xji=xj{0,1}subscriptsuperscript𝑋𝑖𝑗subscript𝑥𝑗01X^{i}_{j}=x_{j}\in\{0,1\}italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ { 0 , 1 }, the reward Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is distributed as {Ri=1X1i=x1,\mathbb{P}\{R_{i}=1\mid X^{i}_{1}=x_{1},\ldotsblackboard_P { italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ∣ italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … ,Xni=xn},X^{i}_{n}=x_{n}\}, italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }.

Given such conditional probabilities, we will write 𝔼[Ria]𝔼delimited-[]conditionalsubscript𝑅𝑖𝑎\mathbb{E}[R_{i}\mid a]blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_a ] to denote the expected value of reward Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT when intervention a𝒜i𝑎subscript𝒜𝑖a\in\mathcal{A}_{i}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is performed at context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ]. Here the expectation is over the parents of the variable Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the causal graph, with the intervened variable set at the required value. Note that these parents (of Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) may in turn have conditional distributions given their parents. The leaf nodes of the causal graph are considered to have unconditional Bernoulli distributions. For instance, 𝔼[Rido(Xji=1)]𝔼delimited-[]conditionalsubscript𝑅𝑖𝑑𝑜subscriptsuperscript𝑋𝑖𝑗1\mathbb{E}[R_{i}\mid do(X^{i}_{j}=1)]blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_d italic_o ( italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 ) ] is the expected reward when variable Xjisubscriptsuperscript𝑋𝑖𝑗X^{i}_{j}italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is set to 1111, and all the other variables independently draw values from their respective (conditional) distributions. Indeed, the goal of this work is to develop an algorithm that maximizes the expected reward at context 00.

Causal Observational Threshold: We denote by misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the causal observational threshold111Maiti et al. (2022) extend the causal observational threshold from Lattimore et al. (2016) to the general setting of causal graphs with unobserved confounders from Maiti et al. (2022) at context i𝑖iitalic_i. This is computed as follows. Let q^ji=minParents(Xji),x{0,1}{Xji=xParents(Xji)}superscriptsubscript^𝑞𝑗𝑖subscriptParentssuperscriptsubscript𝑋𝑗𝑖𝑥01conditional-setsuperscriptsubscript𝑋𝑗𝑖𝑥Parentssuperscriptsubscript𝑋𝑗𝑖\widehat{q}_{j}^{i}=\min_{\text{Parents}(X_{j}^{i}),x\in\{0,1\}}\mathbb{P}\{X_% {j}^{i}=x\mid\text{Parents}(X_{j}^{i})\}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT Parents ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , italic_x ∈ { 0 , 1 } end_POSTSUBSCRIPT blackboard_P { italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_x ∣ Parents ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) }. Further, let Sτi={q^ji:(q^ji)c<1/τ}subscriptsuperscript𝑆𝑖𝜏conditional-setsuperscriptsubscript^𝑞𝑗𝑖superscriptsuperscriptsubscript^𝑞𝑗𝑖𝑐1𝜏S^{i}_{\tau}=\{\widehat{q}_{j}^{i}:(\widehat{q}_{j}^{i})^{c}<1/\tau\}italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = { over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT : ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT < 1 / italic_τ } be sets parameterized by τ𝜏\tauitalic_τ for every τ[2,2n]𝜏22𝑛\tau\in[2,2n]italic_τ ∈ [ 2 , 2 italic_n ], where c𝑐citalic_c indicates the c-component size. Then mi=min{τ such that |Sτi|τ}subscript𝑚𝑖𝜏 such that subscriptsuperscript𝑆𝑖𝜏𝜏m_{i}=\min\{\tau\text{ such that }|S^{i}_{\tau}|\leq\tau\}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_min { italic_τ such that | italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | ≤ italic_τ }. The existence of such a threshold at each context is guaranteed by the assumptions we made on the CBNs. In addition, let Mk×k𝑀superscript𝑘𝑘\smash{M\in\mathbb{N}^{k\times k}}italic_M ∈ blackboard_N start_POSTSUPERSCRIPT italic_k × italic_k end_POSTSUPERSCRIPT denote the diagonal matrix of m1,,mksubscript𝑚1subscript𝑚𝑘m_{1},\ldots,m_{k}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Transitions at Context 0: At context 00, the transition to the intermediate contexts [k]delimited-[]𝑘[k][ italic_k ] stochastically depends on the random variables {X10,,Xn0}subscriptsuperscript𝑋01subscriptsuperscript𝑋0𝑛\{X^{0}_{1},\dots,X^{0}_{n}\}{ italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. Here, {ia}conditional-set𝑖𝑎\mathbb{P}\{i\enspace\mid\enspace a\}blackboard_P { italic_i ∣ italic_a } denotes the probability of transitioning into context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] with atomic intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT; recall that 𝒜0subscript𝒜0\mathcal{A}_{0}caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT includes the do-nothing intervention. We will collectively denote these transition probabilities as matrix P:=[P(a,i)={ia}]a𝒜0,i[k]assign𝑃subscriptdelimited-[]subscript𝑃𝑎𝑖conditional-set𝑖𝑎formulae-sequence𝑎subscript𝒜0𝑖delimited-[]𝑘P:=\left[P_{(a,i)}=\mathbb{P}\{i\enspace\mid\enspace a\}\right]_{a\in\mathcal{% A}_{0},i\in[k]}italic_P := [ italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT = blackboard_P { italic_i ∣ italic_a } ] start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT. Furthermore, write the transition threshold p+subscript𝑝p_{+}italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT to denote the minimum non-zero value in P𝑃Pitalic_P. Note that matrix P|𝒜0|×k𝑃superscriptsubscript𝒜0𝑘P\in\mathbb{R}^{\lvert\mathcal{A}_{0}\rvert\times k}italic_P ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | × italic_k end_POSTSUPERSCRIPT is fixed, but unknown.

Policy: A map π:{0,,k}𝒜:𝜋0𝑘𝒜\pi:\{0,\dots,k\}\to\mathcal{A}italic_π : { 0 , … , italic_k } → caligraphic_A, between contexts and interventions (performed by the algorithm), will be referred to as a policy. Specifically, π(i)𝒜i𝜋𝑖subscript𝒜𝑖\pi(i)\in\mathcal{A}_{i}italic_π ( italic_i ) ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the intervention at context i{0,1,,k}𝑖01𝑘i\in\{0,1,\ldots,k\}italic_i ∈ { 0 , 1 , … , italic_k }. Note that, for any policy π𝜋{\pi}italic_π, the expected reward, which we denote as μ(π)𝜇𝜋\mu(\pi)italic_μ ( italic_π ), is equal to i=1k𝔼[Riπ(i)]{iπ(0)}superscriptsubscript𝑖1𝑘𝔼delimited-[]conditionalsubscript𝑅𝑖𝜋𝑖conditional-set𝑖𝜋0\sum_{i=1}^{k}\mathbb{E}\left[R_{i}\ \mid\ {\pi}(i)\right]\cdot\mathbb{P}\{i\ % \mid\ \pi(0)\}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_π ( italic_i ) ] ⋅ blackboard_P { italic_i ∣ italic_π ( 0 ) }. Maximizing expected reward, at each intermediate context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ], we obtain the overall optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as follows. For i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ]:

π(i)superscript𝜋𝑖\displaystyle\pi^{*}(i)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) =argmaxa𝒜i𝔼[Ria]absentsubscriptargmax𝑎subscript𝒜𝑖𝔼delimited-[]conditionalsubscript𝑅𝑖𝑎\displaystyle=\operatorname*{arg\,max}_{a\in\mathcal{A}_{i}}\ \mathbb{E}\left[% R_{i}\mid a\right]\vspace{-0.05in}= start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_a ] (1)
π(0)superscript𝜋0\displaystyle\vspace{-0.05in}\pi^{*}(0)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) =argmaxb𝒜0(i=1k𝔼[Riπ(i)]{ib})absentsubscriptargmax𝑏subscript𝒜0superscriptsubscript𝑖1𝑘𝔼delimited-[]conditionalsubscript𝑅𝑖superscript𝜋𝑖conditional-set𝑖𝑏\displaystyle=\operatorname*{arg\,max}_{b\in\mathcal{A}_{0}}(\sum_{i=1}^{k}\ % \mathbb{E}\left[R_{i}\mid\pi^{*}(i)\right]\cdot\mathbb{P}\{i\mid b\})= start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_b ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) ] ⋅ blackboard_P { italic_i ∣ italic_b } ) (2)

Our goal then is to find a policy π𝜋\piitalic_π with (expected) reward as close to that of πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as possible.

Simple Regret: Conforming to the standard simple-regret framework, the algorithm is given a time budget T𝑇Titalic_T, i.e., the learner can go through the following process T𝑇Titalic_T times — (a) start at context 00. (b) Choose an intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. (c) Transition to context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ]. (d) Choose an intervention a𝒜i𝑎subscript𝒜𝑖a\in\mathcal{A}_{i}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. (e) Receive reward Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. At the end of these T𝑇Titalic_T steps, the goal of the learner is to compute a policy. Let the policy returned by the learner be π^^𝜋\widehat{\pi}over^ start_ARG italic_π end_ARG. Then the simple regret is defined as the expected value: 𝔼[μ(π)μ(π^]\mathbb{E}[\mu(\pi^{*})-\mu(\widehat{\pi}]blackboard_E [ italic_μ ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_μ ( over^ start_ARG italic_π end_ARG ]. Our algorithm seeks to minimize such a simple regret.

3 Main Algorithm and its Analysis

We now provide the details relating to our main Algorithm, viz. ConvExplore.

Algorithm 1 ConvExplore: Convex Exploration Algorithm
1:Input: Total rounds T𝑇Titalic_T
2:Estimate the transition probabilities P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG from the start state to the intermediate contexts for time T/3𝑇3T/3italic_T / 3, by performing interventions at context 0 in a round robin manner.
3:Estimate the causal observational threshold matrix M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG for time T/3𝑇3T/3italic_T / 3, by performing interventions at context 0 as per frequency vector f~~𝑓\tilde{f}over~ start_ARG italic_f end_ARG where f~argmaxfq. vector fmincontexts [k]P^f~𝑓subscriptargmaxfq. vector 𝑓subscriptcontexts [k]superscript^𝑃top𝑓\tilde{f}\leftarrow\operatorname*{arg\,max}\limits_{\text{fq.~{}vector }f}% \enspace\min\limits_{\text{contexts [k]}}\widehat{P}^{\top}fover~ start_ARG italic_f end_ARG ← start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT fq. vector italic_f end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT contexts [k] end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f.
4:Estimate the reward matrix ^^\widehat{\mathcal{R}}over^ start_ARG caligraphic_R end_ARG for time T/3𝑇3T/3italic_T / 3, by performing interventions 222Computation of f^superscript^𝑓\widehat{f}^{*}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is efficient as we show that the problem is Convex. at context 0 as per frequency vector f^superscript^𝑓\widehat{f}^{*}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT where f^argminfq. vector fmaxinterventions 0P^M^1/2(P^f)12superscript^𝑓subscriptargminfq. vector 𝑓subscriptinterventions subscript0^𝑃superscript^𝑀12superscriptsuperscript^𝑃top𝑓absent12\widehat{f}^{*}\leftarrow\operatorname*{arg\,min}\limits_{\text{fq.~{}vector }% f}\,\max\limits_{\text{interventions }\mathcal{I}_{0}}\widehat{P}\hat{M}^{1/2}% \left(\widehat{P}^{\top}f\right)^{\circ-\frac{1}{2}}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT fq. vector italic_f end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT interventions caligraphic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ) start_POSTSUPERSCRIPT ∘ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT.
5:Estimate the optimal action at each intermediate context π^(i)i[k]^𝜋𝑖for-all𝑖delimited-[]𝑘\widehat{\pi}(i)\enspace\forall i\in[k]over^ start_ARG italic_π end_ARG ( italic_i ) ∀ italic_i ∈ [ italic_k ] based on ^^\widehat{\mathcal{R}}over^ start_ARG caligraphic_R end_ARG. Let the estimate of optimal reward be ^(π^(i))^^𝜋𝑖\widehat{\mathcal{R}}(\widehat{\pi}(i))over^ start_ARG caligraphic_R end_ARG ( over^ start_ARG italic_π end_ARG ( italic_i ) ).
6:Estimate the optimal action at the start context π^(0)^𝜋0\widehat{\pi}(0)over^ start_ARG italic_π end_ARG ( 0 ), based on the transition probabilities P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG and the optimal reward estimates ^(π^(i))^^𝜋𝑖\widehat{\mathcal{R}}(\widehat{\pi}(i))over^ start_ARG caligraphic_R end_ARG ( over^ start_ARG italic_π end_ARG ( italic_i ) ).
7:return π^={π^(0),π^(1),,π^(k)}^𝜋^𝜋0^𝜋1^𝜋𝑘\widehat{\pi}=\{\widehat{\pi}(0),\widehat{\pi}(1),\dots,\widehat{\pi}(k)\}over^ start_ARG italic_π end_ARG = { over^ start_ARG italic_π end_ARG ( 0 ) , over^ start_ARG italic_π end_ARG ( 1 ) , … , over^ start_ARG italic_π end_ARG ( italic_k ) } .

The algorithm can be described by five main steps. In the first step, we estimate the transitions to intermediate contexts. In the second step, we estimate the causal observational thresholds at these contexts. In the third step, we estimate the rewards upon doing interventions at these contexts. With good reward estimates and transition probability estimates, the computation of a good policy at the intermediate contexts (step 4) and at the start state (step 5) is straightforward. This Algorithm relies on three subroutines which are detailed in Section B of the Appendix. The key aspect of this algorithm is in designing the exploration of interventions (at the start state and at the intermediate contexts) to be regret-optimal – i.e. trading off exploration time between different interventions such that the policy eventually obtained has near-optimal reward.

Our algorithm (ConvExplore) uses subroutines to estimate the transition probabilities, the causal parameters, and the rewards. From these, it outputs the best available interventions as its policy π^^𝜋\widehat{\pi}over^ start_ARG italic_π end_ARG. Given time budget T𝑇Titalic_T, the algorithm uses the first T/3𝑇3T/3italic_T / 3 rounds to estimate the transition probabilities (i.e., the matrix P𝑃Pitalic_P) in Algorithm 2. The subsequent T/3𝑇3T/3italic_T / 3 rounds are utilized in Algorithm 3 to estimate causal parameters misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs. Finally, the remaining budget is used in Algorithm 4 to estimate the intervention-dependent reward Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs, for all intermediate contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ].

To judiciously explore the interventions at context 00, ConvExplore computes frequency vectors f|𝒜0|𝑓superscriptsubscript𝒜0f\in\mathbb{R}^{\lvert\mathcal{A}_{0}\rvert}italic_f ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT. In such vectors, the a𝑎aitalic_ath component fa0subscript𝑓𝑎0f_{a}\geq 0italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≥ 0 denotes the fraction of time that each intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is performed by the algorithm, i.e., given time budget Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the intervention a𝑎aitalic_a will be performed faTsubscript𝑓𝑎superscript𝑇f_{a}T^{\prime}italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT times. Note that, by definition, afa=1subscript𝑎subscript𝑓𝑎1\sum_{a}f_{a}=1∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 1 and the frequency vectors are computed by solving convex programs over the estimates. The algorithm and its subroutines throughout consider empirical estimates, i.e., find the estimates by direct counting. Here, let P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG denote the computed estimate of the matrix P𝑃Pitalic_P and M^^𝑀\smash{\hat{M}}over^ start_ARG italic_M end_ARG be the estimate of the diagonal matrix M𝑀Mitalic_M. We obtain a regret upper bound via an optimal frequency vector f^superscript^𝑓\widehat{f}^{*}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (see Step 4 in ConvExplore).

Recall that for any vector x𝑥xitalic_x (with non-negative components), the Hadamard exponentiation 0.5{\circ-0.5}∘ - 0.5 leads to the vector y=x0.5𝑦superscript𝑥absent0.5y=x^{\circ-0.5}italic_y = italic_x start_POSTSUPERSCRIPT ∘ - 0.5 end_POSTSUPERSCRIPT wherein yi=1/xisubscript𝑦𝑖1subscript𝑥𝑖y_{i}=1/\sqrt{x_{i}}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 / square-root start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG for each component i𝑖iitalic_i. We next define a key parameter λ𝜆\lambdaitalic_λ that specifies the regret bound in Theorem 1 (below). At a high-level, parameter λ𝜆\lambdaitalic_λ captures the “exploration efficacy” in the MDP, that takes into account the transition probabilities P𝑃Pitalic_P and the exploration requirements M𝑀Mitalic_M at the intermediate layer. Identification of this parameter is a relevant technical contribution of our work; see Section C.1 for a detailed derivation of λ𝜆\lambdaitalic_λ.

λ:=minfq. vectorfPM0.5(Pf)0.52\displaystyle\lambda:=\min_{\text{fq.~{}vector}f}\ \left\lVert PM^{0.5}\left(P% ^{\top}f\right)^{\circ-0.5}\right\rVert_{\infty}^{2}italic_λ := roman_min start_POSTSUBSCRIPT fq. vector italic_f end_POSTSUBSCRIPT ∥ italic_P italic_M start_POSTSUPERSCRIPT 0.5 end_POSTSUPERSCRIPT ( italic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ) start_POSTSUPERSCRIPT ∘ - 0.5 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (3)

Furthermore, we will write fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to denote the optimal frequency vector in equation (3). Hence, with vector ν:=PM0.5(Pf)0.5assign𝜈𝑃superscript𝑀0.5superscriptsuperscript𝑃topsuperscript𝑓absent0.5\nu:=P{M}^{0.5}(P^{\top}f^{*})^{\circ-0.5}italic_ν := italic_P italic_M start_POSTSUPERSCRIPT 0.5 end_POSTSUPERSCRIPT ( italic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∘ - 0.5 end_POSTSUPERSCRIPT, we have λ=maxaνa2𝜆subscript𝑎superscriptsubscript𝜈𝑎2\lambda=\max_{a}\nu_{a}^{2}italic_λ = roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Note that Step 4 in ConvExplore addresses an analogous optimization problem, albeit with the estimates P^^𝑃\smash{\widehat{P}}over^ start_ARG italic_P end_ARG and M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG. Also, we show in Lemma 11 (see Section E in the supplementary material) that this optimization problem is convex and, hence, Step 4 admits an efficient implementation.

To understand the behaviour of λ𝜆\lambdaitalic_λ, we first note that whenever the misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT values at the contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] are low, the λ𝜆\lambdaitalic_λ value is low. Specifically, the misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT values can go as low as 2222 (when the qjisubscriptsuperscript𝑞𝑖𝑗q^{i}_{j}italic_q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPTs are all 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG), removing the dependence of λ𝜆\lambdaitalic_λ on n𝑛nitalic_n. The upper-bound on λ𝜆\lambdaitalic_λ is nk𝑛𝑘nkitalic_n italic_k. We see this by first upper-bounding each misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by n𝑛nitalic_n. Then, note that whenever maxa𝒜P{i|a}1/ksubscript𝑎𝒜𝑃conditional-set𝑖𝑎1𝑘\max_{a\in\mathcal{A}}P\{i|a\}\geq 1/kroman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_P { italic_i | italic_a } ≥ 1 / italic_k, then f𝑓\exists f∃ italic_f such that Pf=usuperscript𝑃top𝑓𝑢P^{\top}f=uitalic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f = italic_u where u={1k,,1k}𝑢1𝑘1𝑘u=\{\frac{1}{k},\dots,\frac{1}{k}\}italic_u = { divide start_ARG 1 end_ARG start_ARG italic_k end_ARG , … , divide start_ARG 1 end_ARG start_ARG italic_k end_ARG }. Now we can compute that Pu0.52=ksubscriptsuperscriptnorm𝑃superscript𝑢absent0.52𝑘||P\cdot u^{\circ-0.5}||^{2}_{\infty}=k| | italic_P ⋅ italic_u start_POSTSUPERSCRIPT ∘ - 0.5 end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = italic_k, and thereby λ<nk𝜆𝑛𝑘\lambda<nkitalic_λ < italic_n italic_k; See footnote222We show detailed Algorithms for estimation of transition probabilities P𝑃Pitalic_P (line 2), estimation of causal observational threshold M𝑀Mitalic_M (line 3), and estimation of rewards \mathcal{R}caligraphic_R (line 4) in Appendix B.

The following theorem that upper bounds the regret of ConvExplore is the main result of the current work. The result requires the algorithm’s time budget to be at least T0:=O~(Nmax(mi)/p+3)assignsubscript𝑇0~𝑂𝑁subscript𝑚𝑖superscriptsubscript𝑝3T_{0}:=\widetilde{O}\left(N\max(m_{i})/p_{+}^{3}\right)italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := over~ start_ARG italic_O end_ARG ( italic_N roman_max ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )

Theorem 1.

Given number of rounds TT0𝑇subscript𝑇0T\geq T_{0}italic_T ≥ italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and λ𝜆\lambdaitalic_λ as in equation (3), ConvExplore achieves regret

RegretT𝒪(max{λT,m0Tp+}log(NT))subscriptRegret𝑇𝒪𝜆𝑇subscript𝑚0𝑇subscript𝑝𝑁𝑇{\rm Regret}_{T}\in\mathcal{O}\left(\sqrt{\max\left\{\frac{\lambda}{T},\frac{m% _{0}}{Tp_{+}}\right\}\log\left(NT\right)}\right)roman_Regret start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ caligraphic_O ( square-root start_ARG roman_max { divide start_ARG italic_λ end_ARG start_ARG italic_T end_ARG , divide start_ARG italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG } roman_log ( italic_N italic_T ) end_ARG )

Observe that m0/Tp+subscript𝑚0𝑇subscript𝑝m_{0}/Tp_{+}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is independent of the number of contexts and interventions. Therefore λ𝜆\lambdaitalic_λ dominates when number of interventions at an intermediate context is large.

4 Analysis of the Lower Bound

Since ConvExplore solves an optimization problem, it is a priori unclear that a better algorithm may not provide a regret guarantee better than Theorem 1. In this section, we show that for a large class of instances, it is indeed the case that the regret guarantee we provide is optimal. We provide a lower bound on regret for a family of instances. For any number of contexts k𝑘kitalic_k, we show that there exist transition matrices P𝑃Pitalic_P and reward distributions (𝔼[Ria]𝔼delimited-[]conditionalsubscript𝑅𝑖𝑎\mathbb{E}[R_{i}\mid a]blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_a ]) such that regret achieved by ConvExplore  (Theorem 1) is tight, up to log factors.

Theorem 2.

For any qjisubscriptsuperscript𝑞𝑖𝑗q^{i}_{j}italic_q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT corresponding to causal variables at contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ], there exists a transition matrix P𝑃Pitalic_P, and probabilities qj0subscriptsuperscript𝑞0𝑗q^{0}_{j}italic_q start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT corresponding to causal variables {Xj0}j[n]subscriptsubscriptsuperscript𝑋0𝑗𝑗delimited-[]𝑛\{X^{0}_{j}\}_{j\in[n]}{ italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT, and reward distributions, such that the simple regret achieved by any algorithm is

RegretTΩ(λT)subscriptRegret𝑇Ω𝜆𝑇{\rm Regret}_{T}\in\Omega\left(\sqrt{\frac{\lambda}{T}}\right)roman_Regret start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ roman_Ω ( square-root start_ARG divide start_ARG italic_λ end_ARG start_ARG italic_T end_ARG end_ARG )

We provide the details of the proof of Theorem 2 in Section F in the supplementary material.

5 Experiments

We first list a few baseline algorithms that we compare ConvExplore with. This is followed by a complete description of our experimental setup. Finally, we present and discuss our main results.

Uniform Exploration: This algorithm uniformly explores the interventions in the instance. It first performs all the atomic interventions a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at the start state 00 in a round robin manner. On transitioning to any context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ], it performs atomic interventions b𝒜i𝑏subscript𝒜𝑖b\in\mathcal{A}_{i}italic_b ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in a round robin manner. UnifExplore achieves a regret upperbounded by 𝒪~(nk/T)~𝒪𝑛𝑘𝑇\tilde{\mathcal{O}}(\sqrt{nk/T})over~ start_ARG caligraphic_O end_ARG ( square-root start_ARG italic_n italic_k / italic_T end_ARG ), which is also the optimal lower bound for non-causal algorithms. Hence it serves as a good comparison as it achieves an optimal non-causal simple regret. We plot the comparison with this non-causal regret optimal exploration in Figure 3. We plot the regret with respect to (A) the number of rounds of exploration and (B) with the λ𝜆\lambdaitalic_λ values of our instance. Notice that at extremely high λ𝜆\lambdaitalic_λ values ConvExplore does not perform well, as such an instance does not particularly benefit from the causal structure. Even so, with further tuning of constants in our Algorithm, we should achieve a performance similar to UnifExplore.

Refer to caption
Figure 3: We plot the Simple Regret under ConvExplore and UnifExplore. The figure on the left (3a) plots expected simple regret vs time, for the setup n=25𝑛25n=25italic_n = 25, k=25𝑘25k=25italic_k = 25, λ=50𝜆50\lambda=50italic_λ = 50, ε=0.3𝜀0.3\varepsilon=0.3italic_ε = 0.3 and m=2𝑚2m=2italic_m = 2 for all contexts. The figure on the right (3b) plots expected simple regret with λ𝜆\lambdaitalic_λ. It was performed with the parameters: T=25000𝑇25000T=25000italic_T = 25000, k=25𝑘25k=25italic_k = 25, m0=2subscript𝑚02m_{0}=2italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 2 and ε=0.3𝜀0.3\varepsilon=0.3italic_ε = 0.3.

Other Baselines: We now consider several other baselines for comparison, that have been used in literature. Primary amongst these are: (1) UCB at the start state, as well as the intermediate contexts (2) Thompson sampling at the start state, as well as the intermediate contexts (3) Round-robin at the start state, and UCB at the intermediate contexts (4) Round-robin at the start state, and Thomson sampling at the intermediate contexts and (5) UnifExplore which is round-robin at both the start state and at the intermediate contexts.

Setup: We consider k=25𝑘25k=25italic_k = 25 intermediate contexts and a causal graphs with n=25𝑛25n=25italic_n = 25 variables (2n+1=512𝑛1512n+1=512 italic_n + 1 = 51 interventions) at each context. The rewards are distributed Bernoulli(0.5+ε0.5𝜀0.5+\varepsilon0.5 + italic_ε) for intervention X11=1subscriptsuperscript𝑋111X^{1}_{1}=1italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 and Bernoulli(0.50.50.50.5) otherwise where ε=0.3𝜀0.3\varepsilon=0.3italic_ε = 0.3 in the experiments. We set mi=mi[k]subscript𝑚𝑖𝑚for-all𝑖delimited-[]𝑘m_{i}=m\enspace\forall i\in[k]italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_m ∀ italic_i ∈ [ italic_k ]. As in experiments in prior work, we set qji=0subscriptsuperscript𝑞𝑖𝑗0q^{i}_{j}=0italic_q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 for jmi𝑗subscript𝑚𝑖j\leq m_{i}italic_j ≤ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 0.50.50.50.5 otherwise. Let k=n𝑘𝑛k=nitalic_k = italic_n here. At state 00, on taking action a=do()𝑎𝑑𝑜a=do()italic_a = italic_d italic_o ( ), we transition uniformly to one of the intermediate contexts. On taking action do(Xi0=1)𝑑𝑜subscriptsuperscript𝑋0𝑖1do(X^{0}_{i}=1)italic_d italic_o ( italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ), we transition with probability 2/k2𝑘2/k2 / italic_k to context i𝑖iitalic_i and probability 1/k1/(k(k1))1𝑘1𝑘𝑘11/k-1/(k(k-1))1 / italic_k - 1 / ( italic_k ( italic_k - 1 ) ) to any of the other k1𝑘1k-1italic_k - 1 contexts.

We perform two experiments in this setting. In the first one, we run ConvExplore and UnifExplore for time horizon T{1000,,25000}𝑇100025000T\in\{1000,\dots,25000\}italic_T ∈ { 1000 , … , 25000 }. In the second experiment, we run ConvExplore and UnifExplore for a fixed time horizon T=25000𝑇25000T=25000italic_T = 25000 with λ𝜆\lambdaitalic_λ varying in the set {50,75,,625}5075625\{50,75,\dots,625\}{ 50 , 75 , … , 625 }. To vary λ𝜆\lambdaitalic_λ, we vary misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the intermediate contexts in the set {2,3,,25}2325\{2,3,\ldots,25\}{ 2 , 3 , … , 25 }. We average the regret over 10000100001000010000 runs for each setting. We use CVXPY (Diamond & Boyd (2016)) to solve the convex program at Step 4 in ConvExplore. We release our code in entirety in our anonymized GitHub project repository, for the community to use and improve.

Refer to caption
Figure 4: We plot various baselines for two metrics of interest (1) Probability of the algorithm finding the best interventions and (2) Simple regret. These plots illustrate how these metrics vary with the exploration budget.
Refer to caption
Figure 5: We plot the variation of probability of finding the best intervention and simple regret with the number of contexts. Notice the outperformance of ConvExplore vs. the other baselines.

Results of comparison with UnifExplore: In Figure 3a, we compare the expected simple regret of ConvExplore vs. UnifExplore. Our plots indicate that ConvExplore outperforms UnifExplore and its regret falls rapidly as T𝑇Titalic_T increases. In Figure 3b, we plot the expected simple regret against λ𝜆\lambdaitalic_λ for ConvExplore and UnifExplore that was obtained in Experiment 2222, and empirically validate their relationship that was proved in Theorem 1.

Refer to caption
Figure 6: We plot the variation of probability of finding the best intervention and simple regret with λ𝜆\lambdaitalic_λ value. Notice that ConvExplore is the only algorithm that is causal-aware and hence varying with λ𝜆\lambdaitalic_λ.

Results of comparison withother baselines: We find that ConvExplore significantly outperforms baselines other than UnifExplore. Specifically Thompson samplling and UCB are not well tuned to the exploration problem, and hence perform poorly in both the metrics of (1) simple regret as well as (2) probability of finding the best intervention. A mixture of round-robin at the start state with these alternatives at the intermediate context also perform poorly with respect to ConvExplore for this particular exploration problem. In Figure 4 we plot the metrics with exploration budget. In Figure 5 we plot the metrics of interest with the number of contexts at the intermediate stage. Finally, in Figure 6, we plot the simple regret as well as probability of finding the best intervention with our parameter λ𝜆\lambdaitalic_λ, while kee** the number of intermediate contexts the same. The results of these experiments and full details can be found here.

6 Conclusions

We studied extensions of the causal contextual bandits framework to include adaptive context choice. This is an important problem in practice and the solutions therein have immediate practical applications. The setting of stochastic transition to a context accounted for non-trivial extensions from Subramanian & Ravindran (2022) who studied targeted interventions. We developed a Convex Exploration algorithm for minimizing simple regret under this setting. Furthermore, while Maiti et al. (2022) studied the simple causal bandit setting with unobserved confounders, our work addresses causal contextual bandits with adaptive contexts, under the same constraint of allowing unobserved confounders (assuming identifiability). We identified an instance dependent parameter λ𝜆\lambdaitalic_λ, and proved that the regret of this algorithm is 𝒪~(1Tmax{λ,m0p+})~𝒪1𝑇𝜆subscript𝑚0subscript𝑝\tilde{\mathcal{O}}(\sqrt{\frac{1}{T}\max\{\lambda,\frac{m_{0}}{p_{+}}\}})over~ start_ARG caligraphic_O end_ARG ( square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_T end_ARG roman_max { italic_λ , divide start_ARG italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG } end_ARG ). The current work also established that, for certain families of instances, this upper bound is essentially tight. Finally, we showed through experiments that our algorithm performs better than uniform exploration in a range of settings. We believe our method of converting the exploration in the causal contextual bandit setting is novel, and may have implications outside the causal setting as well.

Possible generalizations of this work include extensions to non-binary reward settings. Another natural extension would be to derive bounds for L-layered MDPs, extending from the adaptive contextual bandit setting we consider. It would be interesting to see whether that problem reduces to convex exploration as well. Finally, extending convex exploration methods from this paper to other more general simple regret problems may also be a promising avenue for future research.

7 Acknowledgements

Siddharth Barman gratefully acknowledges the support of the Walmart Center for Tech Excellence (CSR WMGT-23-0001) and a SERB Core research grant (CRG/2021/006165).

References

  • Acharya et al. (2018) Jayadev Acharya, Arnab Bhattacharyya, Constantinos Daskalakis, and Saravanan Kandasamy. Learning and testing causal models with interventions. Advances in Neural Information Processing Systems, 31, 2018.
  • Agrawal & Goyal (2012) Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pp.  39–1. JMLR Workshop and Conference Proceedings, 2012.
  • Ali et al. (2005) R Ayesha Ali, Thomas S Richardson, Peter Spirtes, and Jiji Zhang. Towards characterizing markov equivalence classes for directed acyclic graphs with latent variables. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pp.  10–17, 2005.
  • Audibert et al. (2010) Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identification in multi-armed bandits. In COLT, pp.  41–53, 2010.
  • Auer et al. (1995) P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, FOCS ’95, pp.  322, USA, 1995. IEEE Computer Society. ISBN 0818671831.
  • Auer & Ortner (2010) Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65, 2010.
  • Auer et al. (2002) Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2–3):235–256, may 2002. ISSN 0885-6125. doi: 10.1023/A:1013689704352. URL https://doi.org/10.1023/A:1013689704352.
  • Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning, 2017.
  • Bouneffouf et al. (2020) Djallel Bouneffouf, Irina Rish, and Charu Aggarwal. Survey on applications of multi-armed and contextual bandits. In 2020 IEEE Congress on Evolutionary Computation (CEC), pp.  1–8. IEEE, 2020.
  • Bubeck et al. (2009) Sebastien Bubeck, Remi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In Algorithmic Learning Theory: 20th International Conference, ALT 2009, Porto, Portugal, October 3-5, 2009. Proceedings 20, pp.  23–37. Springer, 2009.
  • Bubeck et al. (2012) Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • Canonne et al. (2017) Clément L Canonne, Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Testing bayesian networks. In Conference on Learning Theory, pp.  370–448. PMLR, 2017.
  • Cover & Thomas (2006) Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006. ISBN 0471241954.
  • Daskalakis et al. (2019) Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Testing ising models. IEEE Transactions on Information Theory, 65(11):6829–6852, 2019.
  • Devroye (1983) Luc Devroye. The equivalence of weak, strong and complete convergence in l1 for kernel density estimates. The Annals of Statistics, 11(3):896–904, 1983. ISSN 00905364. URL http://www.jstor.org/stable/2240651.
  • Diamond & Boyd (2016) Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
  • Eberhardt et al. (2005) Frederick Eberhardt, Clark Glymour, and Richard Scheines. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among n variables. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pp.  178–184, 2005.
  • Even-Dar et al. (2006) Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and stop** conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6), 2006.
  • Feng & Chen (2023) Shi Feng and Wei Chen. Combinatorial causal bandits. Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):7550–7558, Jun. 2023. doi: 10.1609/aaai.v37i6.25917. URL https://ojs.aaai.org/index.php/AAAI/article/view/25917.
  • Hauser & Bühlmann (2014) Alain Hauser and Peter Bühlmann. Two optimal strategies for active learning of causal models from interventional data. International Journal of Approximate Reasoning, 55(4):926–939, 2014.
  • Jamshidi et al. (2024) Fateme Jamshidi, Jalal Etesami, and Negar Kiyavash. Confounded budgeted causal bandits. arXiv preprint arXiv:2401.07578, 2024.
  • Kang & Tian (2006) Changsung Kang and ** Tian. Inequality constraints in causal models with hidden variables. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, UAI’06, pp.  233–240, Arlington, Virginia, USA, 2006. AUAI Press. ISBN 0974903922.
  • Kocaoglu et al. (2017a) Murat Kocaoglu, Alex Dimakis, and Sriram Vishwanath. Cost-optimal learning of causal graphs. In International Conference on Machine Learning, pp.  1875–1884. PMLR, 2017a.
  • Kocaoglu et al. (2017b) Murat Kocaoglu, Karthikeyan Shanmugam, and Elias Bareinboim. Experimental design for learning causal graphs with latent variables. Advances in Neural Information Processing Systems, 30, 2017b.
  • Kuleshov & Precup (2014) Volodymyr Kuleshov and Doina Precup. Algorithms for multi-armed bandit problems. arXiv preprint arXiv:1402.6028, 2014.
  • Lattimore et al. (2016) Finnian Lattimore, Tor Lattimore, and Mark D. Reid. Causal bandits: Learning good interventions via causal inference. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp.  1189–1197, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
  • Lattimore & Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Lattimore & Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020. doi: 10.1017/9781108571401.
  • Lee & Bareinboim (2018) Sanghack Lee and Elias Bareinboim. Structural causal bandits: Where to intervene? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/c0a271bc0ecb776a094786474322cb82-Paper.pdf.
  • Lee & Bareinboim (2019) Sanghack Lee and Elias Bareinboim. Structural causal bandits with non-manipulable variables. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):4164–4172, Jul. 2019. doi: 10.1609/aaai.v33i01.33014164. URL https://ojs.aaai.org/index.php/AAAI/article/view/4320.
  • Lu et al. (2020) Yangyi Lu, Amirhossein Meisami, Ambuj Tewari, and William Yan. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence, pp.  141–150. PMLR, 2020.
  • Lu et al. (2021) Yangyi Lu, Amirhossein Meisami, and Ambuj Tewari. Causal bandits with unknown graph structure. Advances in Neural Information Processing Systems, 34:24817–24828, 2021.
  • Lu et al. (2022) Yangyi Lu, Amirhossein Meisami, and Ambuj Tewari. Efficient reinforcement learning with prior causal knowledge. In Conference on Causal Learning and Reasoning, pp.  526–541. PMLR, 2022.
  • Maiti et al. (2022) Aurghya Maiti, Vineet Nair, and Gaurav Sinha. A causal bandit approach to learning good atomic interventions in presence of unobserved confounders. In Uncertainty in Artificial Intelligence, pp.  1328–1338. PMLR, 2022.
  • Malek et al. (2023) Alan Malek, Virginia Aglietti, and Silvia Chiappa. Additive causal bandits with unknown graph. In International Conference on Machine Learning, pp.  23574–23589. PMLR, 2023.
  • Mannor & Tsitsiklis (2004) Shie Mannor and John N Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648, 2004.
  • Nair et al. (2021) Vineet Nair, Vishakha Patil, and Gaurav Sinha. Budgeted and non-budgeted causal bandits. In Arindam Banerjee and Kenji Fukumizu (eds.), The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pp.  2017–2025. PMLR, 2021. URL http://proceedings.mlr.press/v130/nair21a.html.
  • Orabona et al. (2012) Francesco Orabona, Nicolo Cesa-Bianchi, and Claudio Gentile. Beyond logarithmic bounds in online learning. In Artificial intelligence and statistics, pp.  823–831. PMLR, 2012.
  • Osband & Van Roy (2016) Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
  • Pearl & Verma (1995) Judea Pearl and Thomas S Verma. A theory of inferred causation. In Studies in Logic and the Foundations of Mathematics, volume 134, pp.  789–811. Elsevier, 1995.
  • Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schlkopf. Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press, 2017. ISBN 0262037319.
  • Sen et al. (2017a) Rajat Sen, Karthikeyan Shanmugam, Alexandros G Dimakis, and Sanjay Shakkottai. Identifying best interventions through online importance sampling. In International Conference on Machine Learning, pp.  3057–3066. PMLR, 2017a.
  • Sen et al. (2017b) Rajat Sen, Karthikeyan Shanmugam, Murat Kocaoglu, Alex Dimakis, and Sanjay Shakkottai. Contextual bandits with latent confounders: An nmf approach. In Artificial Intelligence and Statistics, pp.  518–527. PMLR, 2017b.
  • Shanmugam et al. (2015) Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G Dimakis, and Sriram Vishwanath. Learning causal graphs with small interventions. Advances in Neural Information Processing Systems, 28, 2015.
  • Slivkins et al. (2019) Aleksandrs Slivkins et al. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019.
  • Spirtes et al. (2000) Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000.
  • Subramanian & Ravindran (2022) Chandrasekar Subramanian and Balaraman Ravindran. Causal contextual bandits with targeted interventions. In International Conference on Learning Representations, 2022.
  • Tian & Pearl (2002) ** Tian and Judea Pearl. On the testable implications of causal models with hidden variables. In Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pp.  519–527, 2002.
  • Varici et al. (2022) Burak Varici, Karthikeyan Shanmugam, Prasanna Sattigeri, and Ali Tajer. Causal bandits for linear structural equation models. arXiv preprint arXiv:2208.12764, 2022.
  • Wei et al. (2024) Lai Wei, Muhammad Qasim Elahi, Mahsa Ghasemi, and Murat Kocaoglu. Approximate allocation matching for structural causal bandits with unobserved confounders. Advances in Neural Information Processing Systems, 36, 2024.
  • Xiong & Chen (2023) Nuoya Xiong and Wei Chen. Combinatorial pure exploration of causal bandits. In International Conference on Learning Representations, 2023.
  • Yabe et al. (2018) Akihiro Yabe, Daisuke Hatano, Hanna Sumita, Shinji Ito, Naonori Kakimura, Takuro Fukunaga, and Ken-ichi Kawarabayashi. Causal bandits with propagating inference. In International Conference on Machine Learning, pp.  5512–5520. PMLR, 2018.
  • Yan et al. (2024) Zirui Yan, Dennis Wei, Dmitriy Katz-Rogozhnikov, Prasanna Sattigeri, and Ali Tajer. Causal bandits with general causal models and interventions. arXiv preprint arXiv:2403.00233, 2024.
  • Zhang (2008) Jiji Zhang. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence, 172(16-17):1873–1896, 2008.
  • Zhang (2020) Junzhe Zhang. Designing optimal dynamic treatment regimes: A causal reinforcement learning approach. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  11012–11022. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/zhang20a.html.

Appendix A Related Work

In our work, we draw from prior literature from causality as well as from multi-armed bandits. We will briefly cover these two in the following section.

A.1 Multi-armed bandits:

The stochastic Multi-Armed Bandit (MAB) setup is a standard model for studying the exploration-exploitation trade-off in sequential decision making problems (Kuleshov & Precup, 2014; Bubeck et al., 2012). Such trade-offs arise in several modern applications, such as ad placement, website optimization, recommendation systems, and packet routing (Bouneffouf et al., 2020) and are thus a central part of the theory relating to online learning (Slivkins et al., 2019; Lattimore & Szepesvári, 2020).

Traditional performance measures for MAB algorithms have focused on cumulative regret (Auer et al., 2002; Agrawal & Goyal, 2012; Auer & Ortner, 2010), as well as best-arm identification under the fixed confidence (Even-Dar et al., 2006) and fixed budget (Audibert et al., 2010) settings. In some settings however, one may be interested in optimizing the exploration phase. Another variant of regret that has been considered is the mini-max regret (Azar et al., 2017) which focuses on the worst case over all possible environments. However, as a metric for pure exploration in MABs, simple regret has been proposed as a natural performance criterion (Bubeck et al., 2009). In this setting, we allow for some period of exploration, after which the learner has to choose an arm. The simple regret is then evaluated as the difference between the average reward of the best arm and the average reward of the learner’s recommendation. We focus on simple regret in this work.

Each of these performance metrics come with their own lower bounds (Orabona et al., 2012; Osband & Van Roy, 2016; Bubeck et al., 2012), which are naturally the benchmarks for any algorithms proposed. The lower bound on simple regret is known to be 𝒪(n/T)𝒪𝑛𝑇\mathcal{O}(\sqrt{n/T})caligraphic_O ( square-root start_ARG italic_n / italic_T end_ARG ) for a stochastic multi-armed bandit problem with n𝑛nitalic_n arms. This bound is obtained from the lower bound for pure exploration provided by Mannor & Tsitsiklis (2004).

Note that, a naive approach to the causal bandit problem which simply treats an intervention on each of exponentially many combinations of the nodes as an arm, may thus incur an exponential regret. We now review some of the literature from Causality, which helps in addressing the causal aspects of the problem.

A.2 Causality:

There are three broad threads in causality related to our work. These are causal graph learning, causal testing and causal bandits. We address relevant works in these areas below.

Learning Causal Graphs: Tian & Pearl (2002) laid the grounds for analysing functional functional constraints among the distributions of observed variables in a causal Bayesian networks. Similarly, Kang & Tian (2006) derive such functional constraints over interventional distributions. These two seminal works lead to a great interest in the problem of learning causal graphs.

There have been several studies that provide algorithms to recover the causal graphs from the conditional independence relations in observational data (Pearl & Verma, 1995; Spirtes et al., 2000; Ali et al., 2005; Zhang, 2008). Subsequent work considered the setting when both observational and interventional data are available (Eberhardt et al., 2005; Hauser & Bühlmann, 2014). Kocaoglu et al. (2017a) extend the causal graph learning problem to a budgeted setting. Shanmugam et al. (2015) uses interventions on sets of small size to learn the causal structure. Kocaoglu et al. (2017b) provide an efficient randomized algorithm to learn a causal graph with confounding variables.

Testing over Bayesian networks: Given sample access to an unknown Bayesian Network (Canonne et al., 2017), or Ising model (Daskalakis et al., 2019), one may wish to decide whether an unknown model is equal to a known fixed model, and analyse the sample complexity of this hypothesis test. Acharya et al. (2018) address this question by introducing the concept of covering interventions. These covering interventions allow us to understand the behaviour of multiple interventions (that are covered) simultaneously. We utilize the concept of covering interventions from Acharya et al. (2018) towards our question of finding the optimal intervention in a causal bandit. The area of reinforcement learning over causal bandits has also been studied in Zhang (2020).

Apart from these areas in causality, our primary problem of causal bandits have been addressed by Lattimore et al. (2016); Maiti et al. (2022); Sen et al. (2017a); Lu et al. (2020); Nair et al. (2021); Sen et al. (2017b); Lu et al. (2021; 2022); Varici et al. (2022); Xiong & Chen (2023). We detail these in the main Related Works Section 1.2.

Appendix B Algorithms in Detail

In this section, we outline the three algorithms that are used as helpers in ConvExplore. The first that we outline now, Algorithm 2, would be used to estimate the transition probabilities out of context 00 on taking various actions.

Algorithm 2 Estimate Transition Probabilities
1:Input: Time budget Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
2:For time t{1,,T2}𝑡1superscript𝑇2t\leftarrow\{1,\dots,\frac{T^{\prime}}{2}\}italic_t ← { 1 , … , divide start_ARG italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG } do
3:  Perform do()𝑑𝑜do()italic_d italic_o ( ) at context 00. Transition to i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ]
4:  Count number of times context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] is observed
5:  Update q^j0={Xj0=1}superscriptsubscript^𝑞𝑗0superscriptsubscript𝑋𝑗01\widehat{q}_{j}^{0}=\mathbb{P}\left\{X_{j}^{0}=1\right\}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = blackboard_P { italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = 1 }
6:end
7:Using q^j0superscriptsubscript^𝑞𝑗0\widehat{q}_{j}^{0}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPTs, estimate m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the set 𝒜mosubscript𝒜subscript𝑚𝑜\mathcal{A}_{m_{o}}caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Estimate P^(a,i)=[ia]subscript^𝑃𝑎𝑖delimited-[]conditional𝑖𝑎\widehat{P}_{(a,i)}=\mathbb{P}[i\mid a]over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT = blackboard_P [ italic_i ∣ italic_a ] a𝒜m0cfor-all𝑎superscriptsubscript𝒜subscript𝑚0𝑐\enspace\forall a\in\mathcal{A}_{m_{0}}^{c}∀ italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ]
8:For intervention a𝒜mo𝑎subscript𝒜subscript𝑚𝑜a\in\mathcal{A}_{m_{o}}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT at context 00
9:  For time t{1,T2|𝒜m0|}𝑡1superscript𝑇2subscript𝒜subscript𝑚0t\leftarrow\{1,\dots\frac{T^{\prime}}{2\lvert\mathcal{A}_{m_{0}}\rvert}\}italic_t ← { 1 , … divide start_ARG italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG 2 | caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | end_ARG }
10:    Perform a𝒜mo𝑎subscript𝒜subscript𝑚𝑜a\in\mathcal{A}_{m_{o}}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT and transition to some i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ]
11:    Count number of times context i𝑖iitalic_i is observed
12:  end
13:end
14:Estimate P^(a,i)=[ia]subscript^𝑃𝑎𝑖delimited-[]conditional𝑖𝑎\widehat{P}_{(a,i)}=\mathbb{P}[i\mid a]over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT = blackboard_P [ italic_i ∣ italic_a ] for each a𝒜m0𝑎subscript𝒜subscript𝑚0a\in\mathcal{A}_{m_{0}}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ]
15:return Estimated matrix P^=[P^(a,i)]i[k],a𝒜0^𝑃subscriptdelimited-[]subscript^𝑃𝑎𝑖formulae-sequence𝑖delimited-[]𝑘𝑎subscript𝒜0\widehat{P}=\left[\widehat{P}_{(a,i)}\right]_{i\in[k],a\in\mathcal{A}_{0}}over^ start_ARG italic_P end_ARG = [ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] , italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 11footnotetext: In the first half of time T/2superscript𝑇2T^{\prime}/2italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / 2, we perform do()𝑑𝑜do()italic_d italic_o ( ) at State 00.

Next we estimate the causal parameters at all contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] through Algorithm 3. Then we will use Algorithm 4 to estimate the rewards on various interventions at the intermediate contexts.

For estimating the causal parameters, we use a variant of SRM-ALG from Maiti et al. (2022), which estimates the causal observational threshold misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, under the setting of unobserved confounders and identifiability. We note that even in the presence of general causal graphs with hidden variables, SRM-ALG is able to efficiently estimate the rewards of all the arms simultaneously using the observational arm pulls. As mentioned in Section 3 of Maiti et al. (2022), the challenge is to identify the optimal number of arms with bad estimates during the initial phase of the algorithm, such that these arms can be intervened upon at the later phase. The qi(x)subscript𝑞𝑖𝑥q_{i}(x)italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) parameter is the minimum conditional probability of X=x𝑋𝑥X=xitalic_X = italic_x, given different configurations of the parents of X𝑋Xitalic_X. Once we have these estimates, the remaining algorithm can proceed as per usual.

Algorithm 3 Estimate Causal Parameters
1:Input: Frequency vector f~~𝑓\tilde{f}over~ start_ARG italic_f end_ARG and time budget Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
2:Update f(a)12(f~(a)+1|𝒜0|)a𝒜0𝑓𝑎12~𝑓𝑎1subscript𝒜0for-all𝑎subscript𝒜0f(a)\leftarrow\frac{1}{2}\left(\tilde{f}(a)+\frac{1}{\lvert\mathcal{A}_{0}% \rvert}\right)\enspace\forall a\in\mathcal{A}_{0}italic_f ( italic_a ) ← divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( over~ start_ARG italic_f end_ARG ( italic_a ) + divide start_ARG 1 end_ARG start_ARG | caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG ) ∀ italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
3:For intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
4:  For time t{1,Tf(a)}𝑡1superscript𝑇𝑓𝑎t\leftarrow\{1,\dots T^{\prime}\cdot f(a)\}italic_t ← { 1 , … italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ italic_f ( italic_a ) }
5:    Perform a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and transition to some i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ].
6:    At context i𝑖iitalic_i, perform do()𝑑𝑜do()italic_d italic_o ( ) and observe Xjisubscriptsuperscript𝑋𝑖𝑗X^{i}_{j}italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPTs
7:    Update q^ji=minParents(Xji),x{0,1}{Xji=xParents(Xji)}superscriptsubscript^𝑞𝑗𝑖subscriptParentssuperscriptsubscript𝑋𝑗𝑖𝑥01conditional-setsuperscriptsubscript𝑋𝑗𝑖𝑥Parentssuperscriptsubscript𝑋𝑗𝑖\widehat{q}_{j}^{i}=\min_{\text{Parents}(X_{j}^{i}),x\in\{0,1\}}\mathbb{P}% \left\{X_{j}^{i}=x\mid\text{Parents}(X_{j}^{i})\right\}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT Parents ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , italic_x ∈ { 0 , 1 } end_POSTSUBSCRIPT blackboard_P { italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_x ∣ Parents ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) }
8:  end
9:end
10:Using q^jisuperscriptsubscript^𝑞𝑗𝑖\widehat{q}_{j}^{i}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPTs, estimate m^isubscript^𝑚𝑖\widehat{m}_{i}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT values for each context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ]
11:return M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG, the diagonal matrix of the m^isubscript^𝑚𝑖\widehat{m}_{i}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT values 11footnotetext: We choose actions a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that we visit the contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] approximately equally, in expectation.

Note that in Algorithm 4 there are two phases. In the first phase, we carry out estimates for interventions that have high probability of being observed on the do()𝑑𝑜do()italic_d italic_o ( ) intervention. In the second phase, we specifically perform interventions which have not been observed often enough. This is similar to Algorithm 2 where we carry out the two phases of interventions at context 00.

Algorithm 4 Estimate Rewards
1:Input: Optimal frequency fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, min-max frequency f~~𝑓\tilde{f}over~ start_ARG italic_f end_ARG, and time budget Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
2:Set f(a)13(f(a)+f~(a)+1|𝒜0|)a𝒜0𝑓𝑎13superscript𝑓𝑎~𝑓𝑎1subscript𝒜0for-all𝑎subscript𝒜0f(a)\leftarrow\frac{1}{3}\left(f^{*}(a)+\tilde{f}(a)+\frac{1}{\lvert\mathcal{A% }_{0}\rvert}\right)\enspace\forall a\in\mathcal{A}_{0}italic_f ( italic_a ) ← divide start_ARG 1 end_ARG start_ARG 3 end_ARG ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a ) + over~ start_ARG italic_f end_ARG ( italic_a ) + divide start_ARG 1 end_ARG start_ARG | caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG ) ∀ italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
3:For intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at context 00
4:  For time t{1,f(a)T/2}𝑡1𝑓𝑎superscript𝑇2t\leftarrow\{1,\dots f(a)\cdot T^{\prime}/2\}italic_t ← { 1 , … italic_f ( italic_a ) ⋅ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / 2 }
5:    Perform a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Transition to some i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ]. Perform do()𝑑𝑜do()italic_d italic_o ( ) at context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ].
6:    Observe variables Xjisubscriptsuperscript𝑋𝑖𝑗X^{i}_{j}italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s and rewards Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.
7:  end
8:end
9:Find the set 𝒜mii[k]subscript𝒜subscript𝑚𝑖for-all𝑖delimited-[]𝑘\mathcal{A}_{m_{i}}\enspace\forall i\in[k]caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∀ italic_i ∈ [ italic_k ] using qjisubscriptsuperscript𝑞𝑖𝑗q^{i}_{j}italic_q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT estimates.
10:Estimate mean reward ^(b,i)=𝔼[Rib]subscript^𝑏𝑖𝔼delimited-[]conditionalsubscript𝑅𝑖𝑏\widehat{\mathcal{R}}_{(b,i)}=\mathbb{E}\left[R_{i}\mid b\right]over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( italic_b , italic_i ) end_POSTSUBSCRIPT = blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_b ] for each b𝒜mic𝑏superscriptsubscript𝒜subscript𝑚𝑖𝑐b\in\mathcal{A}_{m_{i}}^{c}italic_b ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT
11:For intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at context 00
12:  For time t{1,f(a)T/2}𝑡1𝑓𝑎superscript𝑇2t\leftarrow\{1,\dots f(a)\cdot T^{\prime}/2\}italic_t ← { 1 , … italic_f ( italic_a ) ⋅ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / 2 }
13:    Perform a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and transition to some i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ].
14:    Iteratively perform b𝒜mi𝑏subscript𝒜subscript𝑚𝑖b\in\mathcal{A}_{m_{i}}italic_b ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Observe Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
15:  end
16:end
17:Estimate mean reward ^(b,i)=𝔼[Rib]subscript^𝑏𝑖𝔼delimited-[]conditionalsubscript𝑅𝑖𝑏\widehat{\mathcal{R}}_{(b,i)}=\mathbb{E}\left[R_{i}\mid b\right]over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( italic_b , italic_i ) end_POSTSUBSCRIPT = blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_b ] for each b𝒜mi𝑏subscript𝒜subscript𝑚𝑖b\in\mathcal{A}_{m_{i}}italic_b ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT
18:return ^=[^(b,i)]i[k],b𝒜i^subscriptdelimited-[]subscript^𝑏𝑖formulae-sequence𝑖delimited-[]𝑘𝑏subscript𝒜𝑖\widehat{\mathcal{R}}=\left[\widehat{\mathcal{R}}_{(b,i)}\right]_{i\in[k],b\in% \mathcal{A}_{i}}over^ start_ARG caligraphic_R end_ARG = [ over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( italic_b , italic_i ) end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] , italic_b ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT 11footnotetext: We perform the interventions in the ratio of fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT which is the optimum given the misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT values at the various contexts.

Appendix C Proof of Theorem 1

In this section, we restate Theorem 1 and provide its proof, along with all the lemmas that are used in the proof.

Theorem.

Given number of rounds TT0𝑇subscript𝑇0T\geq T_{0}italic_T ≥ italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and λ𝜆\lambdaitalic_λ as in equation (3), ConvExplore achieves regret

RegretT𝒪(max{λT,m0Tp+}log(NT))subscriptRegret𝑇𝒪𝜆𝑇subscript𝑚0𝑇subscript𝑝𝑁𝑇{\rm Regret}_{T}\in\mathcal{O}\left(\sqrt{\max\left\{\frac{\lambda}{T},\frac{m% _{0}}{Tp_{+}}\right\}\log\left(NT\right)}\right)roman_Regret start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ caligraphic_O ( square-root start_ARG roman_max { divide start_ARG italic_λ end_ARG start_ARG italic_T end_ARG , divide start_ARG italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG } roman_log ( italic_N italic_T ) end_ARG )

C.1 Proof of Theorem 1

To prove the theorem, we analyze the algorithm’s execution as falling under either good event or bad event, and tackle the regret under each.

Definition 1.

We define five events, E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to E5subscript𝐸5E_{5}italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT (see Table 3), the intersection of which we call as good event, E𝐸Eitalic_E, i.e., good event E:=i[5]Eiassigngood event 𝐸subscript𝑖delimited-[]5subscript𝐸𝑖\textit{good event }E:=\bigcap_{i\in[5]}E_{i}good event italic_E := ⋂ start_POSTSUBSCRIPT italic_i ∈ [ 5 ] end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Furthermore, we define the bad event F:=Ecassignbad event 𝐹superscript𝐸𝑐\textit{bad event }F:=E^{c}bad event italic_F := italic_E start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

Table 3: Table enumerating Good Events
Event Condition Explanation
E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT i=1k|P^(a,i)P(a,i)|p+3a𝒜0superscriptsubscript𝑖1𝑘subscript^𝑃𝑎𝑖subscript𝑃𝑎𝑖subscript𝑝3for-all𝑎subscript𝒜0\sum\limits_{i=1}^{k}\lvert\widehat{P}_{(a,i)}-P_{(a,i)}\rvert\leq\frac{p_{+}}% {3}\forall a\in\mathcal{A}_{0}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | ≤ divide start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG 3 end_ARG ∀ italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for every intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the empirical estimate of transition probability in each of Algorithms 2, 3 and 4 is good, up to an absolute factor of p+/3subscript𝑝3p_{+}/3italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT / 3
E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT m^0[23m0,2m0]subscript^𝑚023subscript𝑚02subscript𝑚0\widehat{m}_{0}\in[\frac{2}{3}m_{0},2m_{0}]over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ [ divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 2 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] our estimate for causal parameter m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for state 0 is relatively good in Algorithm 2.
E3subscript𝐸3E_{3}italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT m^i[23mi,2mi]i[k]subscript^𝑚𝑖23subscript𝑚𝑖2subscript𝑚𝑖for-all𝑖delimited-[]𝑘\widehat{m}_{i}\in[\frac{2}{3}m_{i},2m_{i}]\quad\forall i\in[k]over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 2 italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ∀ italic_i ∈ [ italic_k ] our estimate for causal parameter misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] is relatively good in Algorithm 3.
E4subscript𝐸4E_{4}italic_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT i[k]|P^(a,i)P(a,i)|ζsubscript𝑖delimited-[]𝑘subscript^𝑃𝑎𝑖subscript𝑃𝑎𝑖𝜁\sum_{i\in[k]}\lvert\widehat{P}_{(a,i)}-P_{(a,i)}\rvert\leq\zeta∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT | over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | ≤ italic_ζ, a𝒜0for-all𝑎subscript𝒜0\forall a\in\mathcal{A}_{0}∀ italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT The error in estimated transition probability in Algorithm 2 sums to less than ζ𝜁\zetaitalic_ζ where ζ:=150m0Tp+log(3Tk)assign𝜁150subscript𝑚0𝑇subscript𝑝3𝑇𝑘\zeta:=\sqrt{\frac{150m_{0}}{Tp_{+}}\log\left(\frac{3T}{k}\right)}italic_ζ := square-root start_ARG divide start_ARG 150 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 3 italic_T end_ARG start_ARG italic_k end_ARG ) end_ARG
E5subscript𝐸5E_{5}italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT |𝔼[Ria]^(a,i)|η^ii[k],a𝒜iformulae-sequence𝔼delimited-[]conditionalsubscript𝑅𝑖𝑎subscript^𝑎𝑖subscript^𝜂𝑖for-all𝑖delimited-[]𝑘𝑎subscript𝒜𝑖\left\lvert\mathbb{E}\left[R_{i}\mid a\right]-\widehat{\mathcal{R}}_{(a,i)}% \right\rvert\leq\widehat{\eta}_{i}\enspace\forall i\in[k],a\in\mathcal{A}_{i}| blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_a ] - over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | ≤ over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∀ italic_i ∈ [ italic_k ] , italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT The error in reward estimates in Algorithm 4 is bounded333For the interventions a𝒜m0c𝑎superscriptsubscript𝒜subscript𝑚0𝑐a\in\mathcal{A}_{m_{0}}^{c}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, we can estimate P^(a,i)=[ia]subscript^𝑃𝑎𝑖delimited-[]conditional𝑖𝑎\widehat{P}_{(a,i)}=\mathbb{P}[i\mid a]over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT = blackboard_P [ italic_i ∣ italic_a ] i[k]for-all𝑖delimited-[]𝑘\enspace\forall i\in[k]∀ italic_i ∈ [ italic_k ] in the first half.Based on these do()𝑑𝑜do()italic_d italic_o ( ) interventions at each context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ], we get estimates of misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the intervention sets 𝒜misubscript𝒜subscript𝑚𝑖\mathcal{A}_{m_{i}}caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT such that (I) |𝒜mi|=misubscript𝒜subscript𝑚𝑖subscript𝑚𝑖|\mathcal{A}_{m_{i}}|=m_{i}| caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | = italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and (II) interventions in 𝒜misubscript𝒜subscript𝑚𝑖\mathcal{A}_{m_{i}}caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT are observed with probability less than 1/mi1subscript𝑚𝑖1/m_{i}1 / italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.Note that we round robin over the interventions b𝒜mi𝑏subscript𝒜subscript𝑚𝑖b\in\mathcal{A}_{m_{i}}italic_b ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT across visits in the second half of the algorithm. by η^isubscript^𝜂𝑖\widehat{\eta}_{i}over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where η^i=27m^iT(P^f^)ilog(2TN)subscript^𝜂𝑖27subscript^𝑚𝑖𝑇subscriptsuperscript^𝑃topsuperscript^𝑓𝑖2𝑇𝑁\widehat{\eta}_{i}=\sqrt{\frac{27\widehat{m}_{i}}{T(\widehat{P}^{\top}\widehat% {f}^{*})_{i}}\log\left(2TN\right)}over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 27 over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_T ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_log ( 2 italic_T italic_N ) end_ARG
33footnotetext: Recall that f^superscript^𝑓\widehat{f}^{*}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the optimal frequency vector computed in Step 4 of ConvExplore. Also, (P^f^)isubscriptsuperscript^𝑃topsuperscript^𝑓𝑖(\widehat{P}^{\top}\widehat{f}^{*})_{i}( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the i𝑖iitalic_ith component of the vector Pfsuperscript𝑃topsuperscript𝑓P^{\top}f^{*}italic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Considering the estimates P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG and M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG, along with frequency vector222λ𝜆\lambdaitalic_λ is upperbounded by kn, but is typically significantly smaller (as m may be much smaller than n).If 𝒜0:=do(){Xj0=0,Xj0=1}j[n]assignsubscript𝒜0𝑑𝑜subscriptformulae-sequencesubscriptsuperscript𝑋0𝑗0subscriptsuperscript𝑋0𝑗1𝑗delimited-[]𝑛\mathcal{A}_{0}:=do()\cup\{X^{0}_{j}=0,X^{0}_{j}=1\}_{j\in[n]}caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := italic_d italic_o ( ) ∪ { italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 } start_POSTSUBSCRIPT italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT, we can find m0|𝒜0|/2subscript𝑚0subscript𝒜02m_{0}\leq|\mathcal{A}_{0}|/2italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ | caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | / 2 such that 𝒜0=𝒜m0𝒜m0csubscript𝒜0subscript𝒜subscript𝑚0superscriptsubscript𝒜subscript𝑚0𝑐\mathcal{A}_{0}=\mathcal{A}_{m_{0}}\cup\mathcal{A}_{m_{0}}^{c}caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT where the interventions in 𝒜m0csuperscriptsubscript𝒜subscript𝑚0𝑐\mathcal{A}_{m_{0}}^{c}caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT are observed with probability more than 1/m01subscript𝑚01/m_{0}1 / italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and |𝒜m0|=m0subscript𝒜subscript𝑚0subscript𝑚0|\mathcal{A}_{m_{0}}|=m_{0}| caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | = italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.On each visit to a context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ], we perform do()𝑑𝑜do()italic_d italic_o ( ). From these we can estimate qijsuperscriptsubscript𝑞𝑖𝑗q_{i}^{j}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT values, which may be used to estimate misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT values.In the first half we estimate rewards for the interventions 𝒜micsuperscriptsubscript𝒜subscript𝑚𝑖𝑐\mathcal{A}_{m_{i}}^{c}caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT in the first half, and the interventions in 𝒜misubscript𝒜subscript𝑚𝑖\mathcal{A}_{m_{i}}caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT in the second half. f^superscript^𝑓\widehat{f}^{*}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (computed in Step 4), we define random variable

λ^:=P^M^1/2(P^f^)122.assign^𝜆superscriptsubscriptdelimited-∥∥^𝑃superscript^𝑀12superscriptsuperscript^𝑃topsuperscript^𝑓absent122\widehat{\smash{\lambda}}:=\left\lVert\widehat{P}\hat{M}^{1/2}\left(\widehat{P% }^{\top}\widehat{f}^{*}\right)^{\circ-\frac{1}{2}}\right\rVert_{\infty}^{2}.over^ start_ARG italic_λ end_ARG := ∥ over^ start_ARG italic_P end_ARG over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∘ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Note that λ^^𝜆\widehat{\smash{\lambda}}over^ start_ARG italic_λ end_ARG is a surrogate for λ𝜆\lambdaitalic_λ. We will show that under the good event, λ^^𝜆\widehat{\smash{\lambda}}over^ start_ARG italic_λ end_ARG is close to λ𝜆\lambdaitalic_λ (Lemma 3).

Recall that RegretT:=𝔼[ε(π)]assignsubscriptRegret𝑇𝔼delimited-[]𝜀𝜋{\rm Regret}_{T}:=\mathbb{E}[\varepsilon(\pi)]roman_Regret start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT := blackboard_E [ italic_ε ( italic_π ) ] and here the expectation is with respect to the policy π𝜋{\pi}italic_π computed by the algorithm. We can further consider the expected sub-optimality of the algorithm and the quality of the estimates (in particular, P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG, M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG and λ^^𝜆\widehat{\smash{\lambda}}over^ start_ARG italic_λ end_ARG) under good event (E).

Based on the estimates returned at Step 4 of ConvExplore, either the good event holds, or we have the bad event. We obtain the regret guarantee by first bounding sub-optimality of policies computed under the good event, and then bound the probability of the bad event.

Lemma 1.

For the optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, under the good event (E𝐸Eitalic_E), we have i[k]P(π(0),i)𝔼[Riπ(i)]P^(π(0),i)^(π(i),i)𝒪(max{λ^,m0/p+}/Tlog(NT))subscript𝑖delimited-[]𝑘subscript𝑃superscript𝜋0𝑖𝔼delimited-[]conditionalsubscript𝑅𝑖superscript𝜋𝑖subscript^𝑃superscript𝜋0𝑖subscript^superscript𝜋𝑖𝑖𝒪^𝜆subscript𝑚0subscript𝑝𝑇𝑁𝑇\sum_{i\in[k]}P_{(\pi^{*}(0),i)}\mathbb{E}\left[R_{i}\ \mid\ \pi^{*}(i)\right]% -\sum\widehat{{P}}_{(\pi^{*}(0),i)}\widehat{\mathcal{R}}_{(\pi^{*}(i),i)}\leq% \mathcal{O}\left(\sqrt{\max\{\widehat{\smash{\lambda}},m_{0}/p_{+}\}/T\log% \left(NT\right)}\right)∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) , italic_i ) end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) ] - ∑ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) , italic_i ) end_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) , italic_i ) end_POSTSUBSCRIPT ≤ caligraphic_O ( square-root start_ARG roman_max { over^ start_ARG italic_λ end_ARG , italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT } / italic_T roman_log ( italic_N italic_T ) end_ARG )

Proof.

We add and subtract i[k]P(π(0),i)^(π(i),i)subscript𝑖delimited-[]𝑘subscript𝑃superscript𝜋0𝑖subscript^superscript𝜋𝑖𝑖\sum_{i\in[k]}P_{(\pi^{*}(0),i)}\mathcal{\widehat{R}}_{(\pi^{*}(i),i)}∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) , italic_i ) end_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) , italic_i ) end_POSTSUBSCRIPT and reduce the expression on the left to: i[k]P(π(0),i)(𝔼[Riπ(i)]^(π(i),i))+i[k]^(π(i),i)(P(π(0),i)P^(π(0),i))subscript𝑖delimited-[]𝑘subscript𝑃superscript𝜋0𝑖𝔼delimited-[]conditionalsubscript𝑅𝑖superscript𝜋𝑖subscript^superscript𝜋𝑖𝑖subscript𝑖delimited-[]𝑘subscript^superscript𝜋𝑖𝑖subscript𝑃superscript𝜋0𝑖subscript^𝑃superscript𝜋0𝑖\sum_{i\in[k]}P_{(\pi^{*}(0),i)}(\mathbb{E}\left[R_{i}\mid\pi^{*}(i)\right]-% \widehat{\mathcal{R}}_{(\pi^{*}(i),i)})+\sum_{i\in[k]}\widehat{\mathcal{R}}_{(% \pi^{*}(i),i)}(P_{(\pi^{*}(0),i)}-\widehat{P}_{(\pi^{*}(0),i)})∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) , italic_i ) end_POSTSUBSCRIPT ( blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) ] - over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) , italic_i ) end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) , italic_i ) end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) , italic_i ) end_POSTSUBSCRIPT - over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) , italic_i ) end_POSTSUBSCRIPT ).

We have: (a) ^(π(i),i)1subscript^superscript𝜋𝑖𝑖1\widehat{\mathcal{R}}_{(\pi^{*}(i),i)}\leq 1over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) , italic_i ) end_POSTSUBSCRIPT ≤ 1 (as rewards are bounded) (b) i[k]|P^(π(0),i)P(π(0),i)|ζsubscript𝑖delimited-[]𝑘subscript^𝑃superscript𝜋0𝑖subscript𝑃superscript𝜋0𝑖𝜁\sum_{i\in[k]}\lvert\widehat{P}_{(\pi^{*}(0),i)}-P_{(\pi^{*}(0),i)}\rvert\leq\zeta∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT | over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) , italic_i ) end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) , italic_i ) end_POSTSUBSCRIPT | ≤ italic_ζ (by E4subscript𝐸4E_{4}italic_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT) and (c) |𝔼[Riπ(i)]^(π(i),i)|η^i𝔼delimited-[]conditionalsubscript𝑅𝑖superscript𝜋𝑖subscript^superscript𝜋𝑖𝑖subscript^𝜂𝑖\left\lvert\mathbb{E}\left[R_{i}\mid\pi^{*}(i)\right]-\widehat{\mathcal{R}}_{(% \pi^{*}(i),i)}\right\rvert\leq\widehat{\eta}_{i}| blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) ] - over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) , italic_i ) end_POSTSUBSCRIPT | ≤ over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (by E5subscript𝐸5E_{5}italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT). The above expression is thus bounded above by i[k]P(π(0),i)η^i+ζsubscript𝑖delimited-[]𝑘subscript𝑃superscript𝜋0𝑖subscript^𝜂𝑖𝜁\sum_{i\in[k]}P_{(\pi^{*}(0),i)}\widehat{\eta}_{i}+\zeta∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) , italic_i ) end_POSTSUBSCRIPT over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ζ Furthermore, it follows from E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (See Corollary 2 in Section D.1 in the supplementary material) that (component-wise) P32P^𝑃32^𝑃P\leq\frac{3}{2}\widehat{P}italic_P ≤ divide start_ARG 3 end_ARG start_ARG 2 end_ARG over^ start_ARG italic_P end_ARG. Hence, the above-mentioned expression is bounded above by 32i[k]P^(π(0),i)η^i+ζ32subscript𝑖delimited-[]𝑘subscript^𝑃superscript𝜋0𝑖subscript^𝜂𝑖𝜁\frac{3}{2}\sum_{i\in[k]}\widehat{P}_{(\pi^{*}(0),i)}\widehat{\eta}_{i}+\zetadivide start_ARG 3 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) , italic_i ) end_POSTSUBSCRIPT over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ζ. Note that the definition of λ^^𝜆\widehat{\smash{\lambda}}over^ start_ARG italic_λ end_ARG ensures i[k]P^(π(0),i)η^i=𝒪(λ^/Tlog(NT))subscript𝑖delimited-[]𝑘subscript^𝑃superscript𝜋0𝑖subscript^𝜂𝑖𝒪^𝜆𝑇𝑁𝑇\sum_{i\in[k]}\widehat{P}_{(\pi^{*}(0),i)}\widehat{\eta}_{i}=\mathcal{O}(\sqrt% {{\widehat{\smash{\lambda}}}/T\log(NT)})∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) , italic_i ) end_POSTSUBSCRIPT over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_O ( square-root start_ARG over^ start_ARG italic_λ end_ARG / italic_T roman_log ( italic_N italic_T ) end_ARG ). Further, ζ=𝒪(m0/(Tp+)log(T/k))𝜁𝒪subscript𝑚0𝑇subscript𝑝𝑇𝑘\zeta=\mathcal{O}(\sqrt{{m_{0}}/{(Tp_{+})}\log(T/k)})italic_ζ = caligraphic_O ( square-root start_ARG italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / ( italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) roman_log ( italic_T / italic_k ) end_ARG ). Hence, i[k]P(π(0),i)ηi+ζ=𝒪(max{λ^,m0/p+}/Tlog(NT))subscript𝑖delimited-[]𝑘subscript𝑃superscript𝜋0𝑖subscript𝜂𝑖𝜁𝒪^𝜆subscript𝑚0subscript𝑝𝑇𝑁𝑇\sum_{i\in[k]}P_{(\pi^{*}(0),i)}\eta_{i}+\zeta=\mathcal{O}(\sqrt{\max\{{% \widehat{\smash{\lambda}}},m_{0}/p_{+}\}/T\log\left(NT\right)})∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) , italic_i ) end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ζ = caligraphic_O ( square-root start_ARG roman_max { over^ start_ARG italic_λ end_ARG , italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT } / italic_T roman_log ( italic_N italic_T ) end_ARG ), which establishes the lemma. ∎

We now state another similar lemma for any policy π^^𝜋\widehat{\pi}over^ start_ARG italic_π end_ARG computed under good event.

Lemma 2.

Let π^^𝜋\widehat{\pi}over^ start_ARG italic_π end_ARG be a policy computed by ConvExplore under the good event (E𝐸Eitalic_E). Then, i[k]P^(π^(0),i)^(π^(i),i)i[k]P(π^(0),i)𝔼[Riπ^(i)]𝒪(max{λ^,m0/p+}/Tlog(NT))subscript𝑖delimited-[]𝑘subscript^𝑃^𝜋0𝑖subscript^^𝜋𝑖𝑖subscript𝑖delimited-[]𝑘subscript𝑃^𝜋0𝑖𝔼delimited-[]conditionalsubscript𝑅𝑖^𝜋𝑖𝒪^𝜆subscript𝑚0subscript𝑝𝑇𝑁𝑇\sum_{i\in[k]}\widehat{P}_{(\widehat{\pi}(0),i)}\widehat{\mathcal{R}}_{(% \widehat{\pi}(i),i)}-\sum_{i\in[k]}P_{(\widehat{\pi}(0),i)}\mathbb{E}\left[R_{% i}\mid\widehat{\pi}(i)\right]\leq\mathcal{O}\left(\sqrt{\max\{\widehat{\smash{% \lambda}},m_{0}/p_{+}\}/T\log\left(NT\right)}\right)∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( 0 ) , italic_i ) end_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( italic_i ) , italic_i ) end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( 0 ) , italic_i ) end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ over^ start_ARG italic_π end_ARG ( italic_i ) ] ≤ caligraphic_O ( square-root start_ARG roman_max { over^ start_ARG italic_λ end_ARG , italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT } / italic_T roman_log ( italic_N italic_T ) end_ARG )

Proof.

We can add and subtract i[k]P(π^(0),i)^(π^(i),i)subscript𝑖delimited-[]𝑘subscript𝑃^𝜋0𝑖subscript^^𝜋𝑖𝑖\sum_{i\in[k]}P_{(\widehat{\pi}(0),i)}\mathcal{\widehat{R}}_{(\widehat{\pi}(i)% ,i)}∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( 0 ) , italic_i ) end_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( italic_i ) , italic_i ) end_POSTSUBSCRIPT to the expression on the left to get: i[k]^(π^(i),i)(P^(π^(0),i)P(π^(0),i))+i[k]P(π^(0),i)(^(π^(i),i)𝔼[Riπ^(i)])subscript𝑖delimited-[]𝑘subscript^^𝜋𝑖𝑖subscript^𝑃^𝜋0𝑖subscript𝑃^𝜋0𝑖subscript𝑖delimited-[]𝑘subscript𝑃^𝜋0𝑖subscript^^𝜋𝑖𝑖𝔼delimited-[]conditionalsubscript𝑅𝑖^𝜋𝑖\sum_{i\in[k]}\widehat{\mathcal{R}}_{(\widehat{\pi}(i),i)}(\widehat{P}_{(% \widehat{\pi}(0),i)}-P_{(\widehat{\pi}(0),i)})+\sum_{i\in[k]}P_{(\widehat{\pi}% (0),i)}(\widehat{\mathcal{R}}_{(\widehat{\pi}(i),i)}-\mathbb{E}\left[R_{i}\mid% \widehat{\pi}(i)\right])∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( italic_i ) , italic_i ) end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( 0 ) , italic_i ) end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( 0 ) , italic_i ) end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( 0 ) , italic_i ) end_POSTSUBSCRIPT ( over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( italic_i ) , italic_i ) end_POSTSUBSCRIPT - blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ over^ start_ARG italic_π end_ARG ( italic_i ) ] ). Analogous to Lemma 1, one can show that this expression is bounded above by ζ+i[k]32P^(π^(0),i)η^i=𝒪(max{λ^,m0/p+}/Tlog(NT))𝜁subscript𝑖delimited-[]𝑘32subscript^𝑃^𝜋0𝑖subscript^𝜂𝑖𝒪^𝜆subscript𝑚0subscript𝑝𝑇𝑁𝑇\zeta+\sum_{i\in[k]}\frac{3}{2}\widehat{P}_{(\widehat{\pi}(0),i)}\widehat{\eta% }_{i}=\mathcal{O}(\sqrt{\max\{\widehat{\smash{\lambda}},m_{0}/p_{+}\}/T\log% \left(NT\right)})italic_ζ + ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( 0 ) , italic_i ) end_POSTSUBSCRIPT over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_O ( square-root start_ARG roman_max { over^ start_ARG italic_λ end_ARG , italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT } / italic_T roman_log ( italic_N italic_T ) end_ARG ). ∎

We can also bound λ^^𝜆\widehat{\smash{\lambda}}over^ start_ARG italic_λ end_ARG to within a constant factor of λ𝜆\lambdaitalic_λ.

Lemma 3.

Under the good event E𝐸Eitalic_E, we have λ^8λ^𝜆8𝜆\widehat{\smash{\lambda}}\leq 8\lambdaover^ start_ARG italic_λ end_ARG ≤ 8 italic_λ.

Proof.

Event E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ensures that 23PP^43P23𝑃^𝑃43𝑃\frac{2}{3}P\leq\widehat{P}\leq\frac{4}{3}Pdivide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_P ≤ over^ start_ARG italic_P end_ARG ≤ divide start_ARG 4 end_ARG start_ARG 3 end_ARG italic_P (see Corollary 2 in Appendix section D.1). In addition, note that event E3subscript𝐸3E_{3}italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT gives us M^2M^𝑀2𝑀\hat{M}\leq 2Mover^ start_ARG italic_M end_ARG ≤ 2 italic_M. From these observations we obtain the desired bound: λ^=P^M^0.5(P^f^)0.5P^M^0.5(P^f)0.58PM0.5(Pf)0.5=8λ^𝜆^𝑃superscript^𝑀0.5superscriptsuperscript^𝑃topsuperscript^𝑓absent0.5^𝑃superscript^𝑀0.5superscriptsuperscript^𝑃topsuperscript𝑓absent0.58𝑃superscript𝑀0.5superscriptsuperscript𝑃topsuperscript𝑓absent0.58𝜆\widehat{\smash{\lambda}}=\widehat{P}\hat{M}^{0.5}(\widehat{P}^{\top}\widehat{% f}^{*})^{\circ-0.5}\leq\widehat{P}\hat{M}^{0.5}(\widehat{P}^{\top}f^{*})^{% \circ-0.5}\leq 8PM^{0.5}(P^{\top}f^{*})^{\circ-0.5}=8\lambdaover^ start_ARG italic_λ end_ARG = over^ start_ARG italic_P end_ARG over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT 0.5 end_POSTSUPERSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∘ - 0.5 end_POSTSUPERSCRIPT ≤ over^ start_ARG italic_P end_ARG over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT 0.5 end_POSTSUPERSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∘ - 0.5 end_POSTSUPERSCRIPT ≤ 8 italic_P italic_M start_POSTSUPERSCRIPT 0.5 end_POSTSUPERSCRIPT ( italic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∘ - 0.5 end_POSTSUPERSCRIPT = 8 italic_λ; here, the first inequality follows from the fact that f^superscript^𝑓\widehat{f}^{*}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the minimizer of the λ^^𝜆\widehat{\smash{\lambda}}over^ start_ARG italic_λ end_ARG expression, and for the second inequality, we substitute the appropriate bounds of P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG and M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG. ∎

Recall that:

π(i)superscript𝜋𝑖\displaystyle\pi^{*}(i)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) =argmaxa𝒜i𝔼[Ria]absentsubscriptargmax𝑎subscript𝒜𝑖𝔼delimited-[]conditionalsubscript𝑅𝑖𝑎\displaystyle=\operatorname*{arg\,max}_{a\in\mathcal{A}_{i}}\ \mathbb{E}\left[% R_{i}\mid a\right]\vspace{-0.05in}= start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_a ] (4)
π(0)superscript𝜋0\displaystyle\vspace{-0.05in}\pi^{*}(0)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) =argmaxb𝒜0(i=1k𝔼[Riπ(i)]{ib})absentsubscriptargmax𝑏subscript𝒜0superscriptsubscript𝑖1𝑘𝔼delimited-[]conditionalsubscript𝑅𝑖superscript𝜋𝑖conditional-set𝑖𝑏\displaystyle=\operatorname*{arg\,max}_{b\in\mathcal{A}_{0}}(\sum_{i=1}^{k}\ % \mathbb{E}\left[R_{i}\mid\pi^{*}(i)\right]\cdot\mathbb{P}\{i\mid b\})= start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_b ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) ] ⋅ blackboard_P { italic_i ∣ italic_b } ) (5)

We will now define ε(π)𝜀𝜋\varepsilon(\pi)italic_ε ( italic_π ), denoting the sub-optimality of a policy π𝜋\piitalic_π, as the difference between the expected rewards of πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and π𝜋\piitalic_π. i.e. ε(π)=i=1k𝔼[Riπ(i)]{iπ(0)}i=1k𝔼[Riπ(i)]{iπ(0)}𝜀𝜋superscriptsubscript𝑖1𝑘𝔼delimited-[]conditionalsubscript𝑅𝑖superscript𝜋𝑖conditional-set𝑖superscript𝜋0superscriptsubscript𝑖1𝑘𝔼delimited-[]conditionalsubscript𝑅𝑖𝜋𝑖conditional-set𝑖𝜋0\varepsilon(\pi)=\sum_{i=1}^{k}\ \mathbb{E}\left[R_{i}\mid\pi^{*}(i)\right]% \cdot\mathbb{P}\{i\mid\pi^{*}(0)\}-\sum_{i=1}^{k}\ \mathbb{E}\left[R_{i}\mid% \pi(i)\right]\cdot\mathbb{P}\{i\mid\pi(0)\}italic_ε ( italic_π ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) ] ⋅ blackboard_P { italic_i ∣ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) } - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_π ( italic_i ) ] ⋅ blackboard_P { italic_i ∣ italic_π ( 0 ) }.

Corollary 1.

For any π^^𝜋\widehat{\pi}over^ start_ARG italic_π end_ARG computed by ConvExplore under good event E𝐸Eitalic_E, ε(π^)=𝒪(max{λ,m0/p+}/Tlog(NT))𝜀^𝜋𝒪𝜆subscript𝑚0subscript𝑝𝑇𝑁𝑇\varepsilon(\widehat{\pi})=\mathcal{O}\left(\sqrt{\max\{\lambda,m_{0}/p_{+}\}/% T\log\left(NT\right)}\right)italic_ε ( over^ start_ARG italic_π end_ARG ) = caligraphic_O ( square-root start_ARG roman_max { italic_λ , italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT } / italic_T roman_log ( italic_N italic_T ) end_ARG )

Proof.

Since ConvExplore selects the optimal policy (maximizing rewards with respect to the estimates), P^(π(0),i)^(π(i),i)P^(π^(0),i)^(π^(i),i)subscript^𝑃superscript𝜋0𝑖subscript^superscript𝜋𝑖𝑖subscript^𝑃^𝜋0𝑖subscript^^𝜋𝑖𝑖\sum\widehat{P}_{(\pi^{*}(0),i)}\widehat{\mathcal{R}}_{(\pi^{*}(i),i)}\leq\sum% \widehat{P}_{(\widehat{\pi}(0),i)}\widehat{\mathcal{R}}_{(\widehat{\pi}(i),i)}∑ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) , italic_i ) end_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) , italic_i ) end_POSTSUBSCRIPT ≤ ∑ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( 0 ) , italic_i ) end_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( italic_i ) , italic_i ) end_POSTSUBSCRIPT. Combining this with Lemmas 1 and 2, we get i[k]P(π(0),i)𝔼[Riπ(i)]i[k]P(π^(0),i)𝔼[Riπ^(i)]subscript𝑖delimited-[]𝑘subscript𝑃superscript𝜋0𝑖𝔼delimited-[]conditionalsubscript𝑅𝑖superscript𝜋𝑖subscript𝑖delimited-[]𝑘subscript𝑃^𝜋0𝑖𝔼delimited-[]conditionalsubscript𝑅𝑖^𝜋𝑖\sum_{i\in[k]}P_{(\pi^{*}(0),i)}\mathbb{E}\left[R_{i}\mid\pi^{*}(i)\right]-% \sum_{i\in[k]}P_{(\widehat{\pi}(0),i)}\mathbb{E}\left[R_{i}\mid\widehat{\pi}(i% )\right]∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 0 ) , italic_i ) end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) ] - ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( 0 ) , italic_i ) end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ over^ start_ARG italic_π end_ARG ( italic_i ) ] =𝒪(max{λ^,m0/p+}/Tlog(NT))absent𝒪^𝜆subscript𝑚0subscript𝑝𝑇𝑁𝑇=\mathcal{O}(\sqrt{\max\{{\widehat{\smash{\lambda}}},m_{0}/p_{+}\}/T\log\left(% NT\right)})= caligraphic_O ( square-root start_ARG roman_max { over^ start_ARG italic_λ end_ARG , italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT } / italic_T roman_log ( italic_N italic_T ) end_ARG ) under good event. The left-hand-side of this expression is equal to ε(π^)𝜀^𝜋\varepsilon(\widehat{\pi})italic_ε ( over^ start_ARG italic_π end_ARG ). Using Lemma 3, we get that ε(π^)=𝒪(max{λ,m0/p+}/Tlog(NT))𝜀^𝜋𝒪𝜆subscript𝑚0subscript𝑝𝑇𝑁𝑇\varepsilon(\widehat{\pi})=\mathcal{O}\left(\sqrt{\max\{\lambda,m_{0}/p_{+}\}/% T\log\left(NT\right)}\right)italic_ε ( over^ start_ARG italic_π end_ARG ) = caligraphic_O ( square-root start_ARG roman_max { italic_λ , italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT } / italic_T roman_log ( italic_N italic_T ) end_ARG ). ∎

Corollary 1 shows that under the good event, the (true) expected reward of πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and π^^𝜋\widehat{\pi}over^ start_ARG italic_π end_ARG are within 𝒪(max{λ,m0/p+}/Tlog(NT))𝒪𝜆subscript𝑚0subscript𝑝𝑇𝑁𝑇\mathcal{O}\left(\sqrt{\max\{\lambda,m_{0}/p_{+}\}/T\log\left(NT\right)}\right)caligraphic_O ( square-root start_ARG roman_max { italic_λ , italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT } / italic_T roman_log ( italic_N italic_T ) end_ARG ) of each other. In Lemma 10 (see Section D.5 in the supplementary material) we will show 444In the second half, we may intervene on the atomic interventions in 𝒜m0subscript𝒜subscript𝑚0\mathcal{A}_{m_{0}}caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT for time T/(2m0)𝑇2subscript𝑚0T/(2m_{0})italic_T / ( 2 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) each. that {i[5]¬Ei}={F}5k/Tsubscript𝑖delimited-[]5subscript𝐸𝑖𝐹5𝑘𝑇\smash{\mathbb{P}\{\bigcup_{i\in[5]}\neg E_{i}\}=\mathbb{P}\left\{F\right\}% \leq 5k/T}blackboard_P { ⋃ start_POSTSUBSCRIPT italic_i ∈ [ 5 ] end_POSTSUBSCRIPT ¬ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } = blackboard_P { italic_F } ≤ 5 italic_k / italic_T whenever TT0𝑇subscript𝑇0T\geq T_{0}italic_T ≥ italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT555Using observations of a𝒜m0𝑎subscript𝒜subscript𝑚0a\in\mathcal{A}_{m_{0}}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we estimate P^(a,i)=[ia]subscript^𝑃𝑎𝑖delimited-[]conditional𝑖𝑎\widehat{P}_{(a,i)}=\mathbb{P}[i\mid a]over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT = blackboard_P [ italic_i ∣ italic_a ] a𝒜m0for-all𝑎subscript𝒜subscript𝑚0\enspace\forall a\in\mathcal{A}_{m_{0}}∀ italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ]..

44footnotetext: Recall that, by definition, F=Ec𝐹superscript𝐸𝑐F=E^{c}italic_F = italic_E start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.55footnotetext: T0subscript𝑇0T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as defined in Lemma 10 in Section D.5 in the supplementary material.

The above-mentioned bounds together establish Theorem 1 (i.e., bound the regret of ConvExplore): RegretT=𝔼[ε(π)]=𝔼[ε(π^)E]{E}+𝔼[ε(π)F]{F}subscriptRegret𝑇𝔼delimited-[]𝜀𝜋𝔼delimited-[]conditional𝜀^𝜋𝐸𝐸𝔼delimited-[]conditional𝜀superscript𝜋𝐹𝐹{\rm Regret}_{T}=\mathbb{E}[\varepsilon(\pi)]=\mathbb{E}[\varepsilon(\widehat{% \pi})\mid E]\mathbb{P}\left\{E\right\}+\mathbb{E}[\varepsilon(\pi^{\prime})% \mid F]\mathbb{P}\left\{F\right\}roman_Regret start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = blackboard_E [ italic_ε ( italic_π ) ] = blackboard_E [ italic_ε ( over^ start_ARG italic_π end_ARG ) ∣ italic_E ] blackboard_P { italic_E } + blackboard_E [ italic_ε ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∣ italic_F ] blackboard_P { italic_F }. Since the rewards are bounded between 00 and 1111, we have ε(π)1𝜀superscript𝜋1\varepsilon(\pi^{\prime})\leq 1italic_ε ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ 1, for all policies πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. But {E}1𝐸1\mathbb{P}\{E\}\leq 1blackboard_P { italic_E } ≤ 1 giving us RegretT𝔼[ε(π)E]+{F}subscriptRegret𝑇𝔼delimited-[]conditional𝜀𝜋𝐸𝐹{\rm Regret}_{T}\leq\mathbb{E}[\varepsilon(\pi)\mid E]+\mathbb{P}\{F\}roman_Regret start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≤ blackboard_E [ italic_ε ( italic_π ) ∣ italic_E ] + blackboard_P { italic_F }. Therefore, Corollary 1 along with Lemma 10, leads to guarantee RegretT=𝒪(max{λ,m0/p+}/Tlog(NT))+5k/T=𝒪(max{λ,m0/p+}/Tlog(NT))subscriptRegret𝑇𝒪𝜆subscript𝑚0subscript𝑝𝑇𝑁𝑇5𝑘𝑇𝒪𝜆subscript𝑚0subscript𝑝𝑇𝑁𝑇{\rm Regret}_{T}=\mathcal{O}\left(\sqrt{\max\{\lambda,m_{0}/p_{+}\}/T\log\left% (NT\right)}\right)+5k/T=\mathcal{O}\left(\sqrt{\max\{\lambda,m_{0}/p_{+}\}/T% \log\left(NT\right)}\right)roman_Regret start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = caligraphic_O ( square-root start_ARG roman_max { italic_λ , italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT } / italic_T roman_log ( italic_N italic_T ) end_ARG ) + 5 italic_k / italic_T = caligraphic_O ( square-root start_ARG roman_max { italic_λ , italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT } / italic_T roman_log ( italic_N italic_T ) end_ARG )

Appendix D Bounding the Probability of the Bad Event

Recall that the good event corresponds to i5Eisubscript𝑖5subscript𝐸𝑖\bigcap_{i\in 5}E_{i}⋂ start_POSTSUBSCRIPT italic_i ∈ 5 end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (see Definition 1). Write F:=¬(i5Ei)assign𝐹subscript𝑖5subscript𝐸𝑖F:=\neg\left(\bigcap_{i\in 5}E_{i}\right)italic_F := ¬ ( ⋂ start_POSTSUBSCRIPT italic_i ∈ 5 end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and note that, for the regret analysis, we require an upper bound on {F}={¬(i5Ei)}={i5¬Ei}𝐹subscript𝑖5subscript𝐸𝑖subscript𝑖5subscript𝐸𝑖\mathbb{P}\{F\}=\mathbb{P}\left\{\neg(\bigcap_{i\in 5}E_{i})\right\}=\mathbb{P% }\left\{\bigcup_{i\in 5}\neg E_{i}\right\}blackboard_P { italic_F } = blackboard_P { ¬ ( ⋂ start_POSTSUBSCRIPT italic_i ∈ 5 end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } = blackboard_P { ⋃ start_POSTSUBSCRIPT italic_i ∈ 5 end_POSTSUBSCRIPT ¬ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. Towards this, in this section we address {¬Ei}subscript𝐸𝑖\mathbb{P}\{\neg E_{i}\}blackboard_P { ¬ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, for each of the events E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-E5subscript𝐸5E_{5}italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, and then apply the union bound.

D.1 Bound on ¬E1subscript𝐸1\neg E_{1}¬ italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

The next lemma upper bounds the probability of ¬E1subscript𝐸1\neg E_{1}¬ italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Lemma 4.

In each of Algorithms 2, 3 and 4 and for all interventions a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we have {¬E1}={i=1k|P^(a,i)P(a,i)|>p+3}<kTsubscript𝐸1superscriptsubscript𝑖1𝑘subscript^𝑃𝑎𝑖subscript𝑃𝑎𝑖subscript𝑝3𝑘𝑇\mathbb{P}\{\neg E_{1}\}=\mathbb{P}\left\{\sum\limits_{i=1}^{k}\lvert\widehat{% P}_{(a,i)}-P_{(a,i)}\rvert>\frac{p_{+}}{3}\right\}<\frac{k}{T}blackboard_P { ¬ italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } = blackboard_P { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | > divide start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG 3 end_ARG } < divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG whenever Tmax{1620Np+3,2025Np+2log(9NTk)}𝑇1620𝑁superscriptsubscript𝑝32025𝑁superscriptsubscript𝑝29𝑁𝑇𝑘T\geq\max\left\{\frac{1620N}{p_{+}^{3}},\frac{2025N}{p_{+}^{2}}\log\left(\frac% {9NT}{k}\right)\right\}italic_T ≥ roman_max { divide start_ARG 1620 italic_N end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG , divide start_ARG 2025 italic_N end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log ( divide start_ARG 9 italic_N italic_T end_ARG start_ARG italic_k end_ARG ) }.

Proof.

On performing any intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at context 00, the intermediate context that we visit follows a multinomial distribution. Hence, we can apply Devroye’s inequality (for multinomial distributions) to obtain a concentration guarantee; we state the inequality next in our notation.

Lemma 5 (Restatement of Lemma 3 in Devroye (1983)).

Let Tasubscript𝑇𝑎T_{a}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT be the number of times intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is performed in context 00. Then, for any η>0𝜂0\eta>0italic_η > 0 and any Ta20sη2subscript𝑇𝑎20𝑠superscript𝜂2T_{a}\geq\frac{20s}{\eta^{2}}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≥ divide start_ARG 20 italic_s end_ARG start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, we have {i=1k|P^(a,i)P(a,i)|>η}3exp(Taη225)superscriptsubscript𝑖1𝑘subscript^𝑃𝑎𝑖subscript𝑃𝑎𝑖𝜂3subscript𝑇𝑎superscript𝜂225\mathbb{P}\left\{\sum\limits_{i=1}^{k}\lvert\widehat{P}_{(a,i)}-P_{(a,i)}% \rvert>\eta\right\}\leq 3\exp\left(-\frac{T_{a}\eta^{2}}{25}\right)blackboard_P { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | > italic_η } ≤ 3 roman_exp ( - divide start_ARG italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 25 end_ARG ). Here, s𝑠sitalic_s is the support of the distribution (i.e., the number of contexts that can be reached from a𝑎aitalic_a with a nonzero probability).

Note that each intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is performed at least Ta=T9Nsubscript𝑇𝑎𝑇9𝑁T_{a}=\frac{T}{9N}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = divide start_ARG italic_T end_ARG start_ARG 9 italic_N end_ARG times across Algorithms 2, 3 and 4. Setting η=p+3𝜂subscript𝑝3\eta=\frac{p_{+}}{3}italic_η = divide start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG 3 end_ARG and Ta=T9Nsubscript𝑇𝑎𝑇9𝑁T_{a}=\frac{T}{9N}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = divide start_ARG italic_T end_ARG start_ARG 9 italic_N end_ARG above, we get that for each intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, in each subroutine, {i=1k|P(a,i)P^(a,i)|>p+3}3exp(Tp+29N925)=3exp(Tp+22025N)superscriptsubscript𝑖1𝑘subscript𝑃𝑎𝑖subscript^𝑃𝑎𝑖subscript𝑝33𝑇superscriptsubscript𝑝29𝑁9253𝑇superscriptsubscript𝑝22025𝑁\mathbb{P}\left\{\sum_{i=1}^{k}\lvert P_{(a,i)}-\widehat{P}_{(a,i)}\rvert>% \frac{p_{+}}{3}\right\}\leq 3\exp\left(-\frac{Tp_{+}^{2}}{9N\cdot 9\cdot 25}% \right)=3\exp\left(-\frac{Tp_{+}^{2}}{2025N}\right)blackboard_P { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT - over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | > divide start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG 3 end_ARG } ≤ 3 roman_exp ( - divide start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 9 italic_N ⋅ 9 ⋅ 25 end_ARG ) = 3 roman_exp ( - divide start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2025 italic_N end_ARG ).

Note that to apply the inequality, we require T9N180sp+2𝑇9𝑁180𝑠superscriptsubscript𝑝2\frac{T}{9N}\geq\frac{180s}{p_{+}^{2}}divide start_ARG italic_T end_ARG start_ARG 9 italic_N end_ARG ≥ divide start_ARG 180 italic_s end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, i.e., T1620sNp+2𝑇1620𝑠𝑁superscriptsubscript𝑝2T\geq\frac{1620sN}{p_{+}^{2}}italic_T ≥ divide start_ARG 1620 italic_s italic_N end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. In the current context, the support size s𝑠sitalic_s is at most 1p+1subscript𝑝\frac{1}{p_{+}}divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG; this follows from the fact that on performing any intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, at most 1p+1subscript𝑝\frac{1}{p_{+}}divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG contexts can have P(a,i)p+subscript𝑃𝑎𝑖subscript𝑝P_{(a,i)}\geq p_{+}italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ≥ italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Hence, the requirement reduces to T1620Np+3𝑇1620𝑁superscriptsubscript𝑝3T\geq\frac{1620N}{p_{+}^{3}}italic_T ≥ divide start_ARG 1620 italic_N end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG.

Next, we union bound the probability over the N𝑁Nitalic_N interventions (at state 00) and the three subroutines, to obtain that, for any intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and in any subroutine, {i=1k|P(a,i)P^(a,i)|>p+3}3N3exp(Tp+22025N)=9Nexp(Tp+22025N)superscriptsubscript𝑖1𝑘subscript𝑃𝑎𝑖subscript^𝑃𝑎𝑖subscript𝑝33𝑁3𝑇superscriptsubscript𝑝22025𝑁9𝑁𝑇superscriptsubscript𝑝22025𝑁\mathbb{P}\left\{\sum_{i=1}^{k}\lvert P_{(a,i)}-\widehat{P}_{(a,i)}\rvert>% \frac{p_{+}}{3}\right\}\leq 3N\cdot 3\exp\left(-\frac{Tp_{+}^{2}}{2025N}\right% )=9N\exp\left(-\frac{Tp_{+}^{2}}{2025N}\right)blackboard_P { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT - over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | > divide start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG 3 end_ARG } ≤ 3 italic_N ⋅ 3 roman_exp ( - divide start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2025 italic_N end_ARG ) = 9 italic_N roman_exp ( - divide start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2025 italic_N end_ARG ).

Note that 9Nexp(Tp+22025N)kT9𝑁𝑇superscriptsubscript𝑝22025𝑁𝑘𝑇9N\exp\left(-\frac{Tp_{+}^{2}}{2025N}\right)\leq\frac{k}{T}9 italic_N roman_exp ( - divide start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2025 italic_N end_ARG ) ≤ divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG, for any T2025Np+2log(9NTk)𝑇2025𝑁superscriptsubscript𝑝29𝑁𝑇𝑘T\geq\frac{2025N}{p_{+}^{2}}\log\left(\frac{9NT}{k}\right)italic_T ≥ divide start_ARG 2025 italic_N end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log ( divide start_ARG 9 italic_N italic_T end_ARG start_ARG italic_k end_ARG ). Hence, for any Tmax{1620Np+3,2025Np+2log(9NTk)}𝑇1620𝑁superscriptsubscript𝑝32025𝑁superscriptsubscript𝑝29𝑁𝑇𝑘T\geq\max\left\{\frac{1620N}{p_{+}^{3}},\frac{2025N}{p_{+}^{2}}\log\left(\frac% {9NT}{k}\right)\right\}italic_T ≥ roman_max { divide start_ARG 1620 italic_N end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG , divide start_ARG 2025 italic_N end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log ( divide start_ARG 9 italic_N italic_T end_ARG start_ARG italic_k end_ARG ) }, we have [¬E1]9Nexp(Tp+22025N)kTdelimited-[]subscript𝐸19𝑁𝑇superscriptsubscript𝑝22025𝑁𝑘𝑇\mathbb{P}[\neg E_{1}]\leq 9N\exp\left(-\frac{Tp_{+}^{2}}{2025N}\right)\leq% \frac{k}{T}blackboard_P [ ¬ italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ 9 italic_N roman_exp ( - divide start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2025 italic_N end_ARG ) ≤ divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG. This completes the proof of the lemma. ∎

We state below a corollary which provides a multiplicative bound on P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG with respect to P𝑃Pitalic_P, complementing the additive form of E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Corollary 2.

Under event E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we have 23P(a,i)P^(a,i)43P(a,i)23subscript𝑃𝑎𝑖subscript^𝑃𝑎𝑖43subscript𝑃𝑎𝑖\frac{2}{3}P_{(a,i)}\leq\widehat{P}_{(a,i)}\leq\frac{4}{3}P_{(a,i)}divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ≤ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ≤ divide start_ARG 4 end_ARG start_ARG 3 end_ARG italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT, for all interventions a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ].

Proof.

Event E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ensures that i=1k|P^(a,i)P(a,i)|p+3superscriptsubscript𝑖1𝑘subscript^𝑃𝑎𝑖subscript𝑃𝑎𝑖subscript𝑝3\sum\limits_{i=1}^{k}\lvert\widehat{P}_{(a,i)}-P_{(a,i)}\rvert\leq\frac{p_{+}}% {3}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | ≤ divide start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG 3 end_ARG, for each interventions a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ]. This, in particular, implies that for each intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] the following inequality holds: |P^(a,i)P(a,i)|p+3subscript^𝑃𝑎𝑖subscript𝑃𝑎𝑖subscript𝑝3\lvert\widehat{P}_{(a,i)}-P_{(a,i)}\rvert\leq\frac{p_{+}}{3}| over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | ≤ divide start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG 3 end_ARG. Note that if P(a,i)=0subscript𝑃𝑎𝑖0P_{(a,i)}=0italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT = 0, then the algorithm will never observe context i𝑖iitalic_i with intervention a𝑎aitalic_a, i.e., in such a case P^(a,i)=P(a,i)=0subscript^𝑃𝑎𝑖subscript𝑃𝑎𝑖0\widehat{P}_{(a,i)}=P_{(a,i)}=0over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT = 0. For the nonzero P(a,i)subscript𝑃𝑎𝑖P_{(a,i)}italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPTs, recall that (by definition), p+=min{P(a,i)P(a,i)>0}subscript𝑝subscript𝑃𝑎𝑖ketsubscript𝑃𝑎𝑖0p_{+}=\min\{P_{(a,i)}\mid P_{(a,i)}>0\}italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = roman_min { italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ∣ italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT > 0 }. Therefore, for any nonzero P(a,i)subscript𝑃𝑎𝑖P_{(a,i)}italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT, the above-mentioned inequality gives us |P^(a,i)P(a,i)|13P(a,i)subscript^𝑃𝑎𝑖subscript𝑃𝑎𝑖13subscript𝑃𝑎𝑖\lvert\widehat{P}_{(a,i)}-P_{(a,i)}\rvert\leq\frac{1}{3}P_{(a,i)}| over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | ≤ divide start_ARG 1 end_ARG start_ARG 3 end_ARG italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT. Equivalently, P^(a,i)43P(a,i)subscript^𝑃𝑎𝑖43subscript𝑃𝑎𝑖\widehat{P}_{(a,i)}\leq\frac{4}{3}P_{(a,i)}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ≤ divide start_ARG 4 end_ARG start_ARG 3 end_ARG italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT and P^(a,i)23P(a,i)subscript^𝑃𝑎𝑖23subscript𝑃𝑎𝑖\widehat{P}_{(a,i)}\geq\frac{2}{3}P_{(a,i)}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ≥ divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT. Therefore, for all P(a,i)subscript𝑃𝑎𝑖P_{(a,i)}italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPTs the corollary holds. ∎

D.2 Bound on Events ¬E2subscript𝐸2\neg E_{2}¬ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and ¬E3subscript𝐸3\neg E_{3}¬ italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT

In this section, we bound the probabilities that our estimated m^isubscript^𝑚𝑖\widehat{m}_{i}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs are far away from the true causal parameters misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs.

Lemma 6.

For any T144m0log(TNk)𝑇144subscript𝑚0𝑇𝑁𝑘T\geq 144m_{0}\log\left(\frac{TN}{k}\right)italic_T ≥ 144 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_T italic_N end_ARG start_ARG italic_k end_ARG ), in Algorithm 2, [¬E2]={m^0[23m0,2m0]}kTdelimited-[]subscript𝐸2subscript^𝑚023subscript𝑚02subscript𝑚0𝑘𝑇\mathbb{P}[\neg E_{2}]=\mathbb{P}\left\{\widehat{m}_{0}\notin[\frac{2}{3}m_{0}% ,2m_{0}]\right\}\leq\frac{k}{T}blackboard_P [ ¬ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = blackboard_P { over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∉ [ divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 2 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] } ≤ divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG.

Proof.

We allocate time T3𝑇3\frac{T}{3}divide start_ARG italic_T end_ARG start_ARG 3 end_ARG to Algorithm 2. Lemma 8 of Lattimore et al. (2016) ensures that, for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ) and T348m0log(Nδ)𝑇348subscript𝑚0𝑁𝛿\frac{T}{3}\geq 48m_{0}\log(\frac{N}{\delta})divide start_ARG italic_T end_ARG start_ARG 3 end_ARG ≥ 48 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_N end_ARG start_ARG italic_δ end_ARG ), we have m^0[23m0,2m0]subscript^𝑚023subscript𝑚02subscript𝑚0\widehat{m}_{0}\in[\frac{2}{3}m_{0},2m_{0}]over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ [ divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 2 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ], with probability at least (1δ)1𝛿(1-\delta)( 1 - italic_δ ). Setting δ=kT𝛿𝑘𝑇\delta=\frac{k}{T}italic_δ = divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG, we get the required probability bound. ∎

Next, we address {¬E3E1}conditional-setsubscript𝐸3subscript𝐸1\mathbb{P}\{\neg E_{3}\mid E_{1}\}blackboard_P { ¬ italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∣ italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }.

Lemma 7.

For any T648max(mi)Np+log(2NT)𝑇648subscript𝑚𝑖𝑁subscript𝑝2𝑁𝑇T\geq\frac{648\max(m_{i})N}{p_{+}}\log\left(2NT\right)italic_T ≥ divide start_ARG 648 roman_max ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_N end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG roman_log ( 2 italic_N italic_T ), in each of Algorithms 3 and 4, we have {i[k],m^i[23mi,2mi]|E1}kTconditional-setformulae-sequence𝑖delimited-[]𝑘subscript^𝑚𝑖23subscript𝑚𝑖2subscript𝑚𝑖subscript𝐸1𝑘𝑇\mathbb{P}\left\{\exists i\in[k],\quad\widehat{m}_{i}\notin[\frac{2}{3}m_{i},2% m_{i}]\ \big{|}\ E_{1}\right\}\leq\frac{k}{T}blackboard_P { ∃ italic_i ∈ [ italic_k ] , over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ [ divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 2 italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] | italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } ≤ divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG.

Proof.

Fix any reachable context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ]. Corresponding to such a context, there exists an intervention α𝒜0𝛼subscript𝒜0\alpha\in\mathcal{A}_{0}italic_α ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that P(α,i)p+subscript𝑃𝛼𝑖subscript𝑝P_{(\alpha,i)}\geq p_{+}italic_P start_POSTSUBSCRIPT ( italic_α , italic_i ) end_POSTSUBSCRIPT ≥ italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Event E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (Corollary 2) implies that P^(α,i)23P(α,i)23p+subscript^𝑃𝛼𝑖23subscript𝑃𝛼𝑖23subscript𝑝\widehat{P}_{(\alpha,i)}\geq\frac{2}{3}P_{(\alpha,i)}\geq\frac{2}{3}p_{+}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_α , italic_i ) end_POSTSUBSCRIPT ≥ divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_P start_POSTSUBSCRIPT ( italic_α , italic_i ) end_POSTSUBSCRIPT ≥ divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT.

Now, write Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote the number of times context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] is visited by the Algorithms 3 and 4. Recall that in the subroutines we estimate P^(α,i)subscript^𝑃𝛼𝑖\widehat{P}_{(\alpha,i)}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_α , italic_i ) end_POSTSUBSCRIPT by counting the number of times context i𝑖iitalic_i was reached and simultaneously intervention α𝛼\alphaitalic_α observed. Furthermore, note that we allocate to every intervention at least T9N𝑇9𝑁\frac{T}{9N}divide start_ARG italic_T end_ARG start_ARG 9 italic_N end_ARG time (See Steps 2 in both the subroutines). In particular, intervention α𝛼\alphaitalic_α was necessarily observed T9N𝑇9𝑁\frac{T}{9N}divide start_ARG italic_T end_ARG start_ARG 9 italic_N end_ARG times. Therefore, P^(a,i)Ti(T9N)subscript^𝑃𝑎𝑖subscript𝑇𝑖𝑇9𝑁\widehat{P}_{(a,i)}\leq\frac{T_{i}}{\left(\frac{T}{9N}\right)}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ≤ divide start_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ( divide start_ARG italic_T end_ARG start_ARG 9 italic_N end_ARG ) end_ARG. This inequality leads to a useful lower bound: TiT9NP(a,i)T2p+27Nsubscript𝑇𝑖𝑇9𝑁subscript𝑃𝑎𝑖𝑇2subscript𝑝27𝑁T_{i}\geq\frac{T}{9N}\ P_{(a,i)}\geq T\frac{2p_{+}}{27N}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ divide start_ARG italic_T end_ARG start_ARG 9 italic_N end_ARG italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ≥ italic_T divide start_ARG 2 italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG 27 italic_N end_ARG.

We now restate Lemma 8 from Lattimore et al. (2016): Let Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the number of times context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] is observed. Then, {m^i[23mi,2mi]}2Nexp(Ti48mi)subscript^𝑚𝑖23subscript𝑚𝑖2subscript𝑚𝑖2𝑁subscript𝑇𝑖48subscript𝑚𝑖\mathbb{P}\left\{\widehat{m}_{i}\notin[\frac{2}{3}m_{i},2m_{i}]\right\}\leq 2N% \exp\left(-\frac{T_{i}}{48m_{i}}\right)blackboard_P { over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ [ divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 2 italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] } ≤ 2 italic_N roman_exp ( - divide start_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 48 italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ).

Since Ti2Tp+27Nsubscript𝑇𝑖2𝑇subscript𝑝27𝑁T_{i}\geq\frac{2Tp_{+}}{27N}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ divide start_ARG 2 italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG 27 italic_N end_ARG, this guarantee of Lattimore et al. (2016) corresponds to {m^i[23mi,2mi]}2Nexp(Tp+648Nmi)2Nexp(Tp+648Nmax(mi))subscript^𝑚𝑖23subscript𝑚𝑖2subscript𝑚𝑖2𝑁𝑇subscript𝑝648𝑁subscript𝑚𝑖2𝑁𝑇subscript𝑝648𝑁subscript𝑚𝑖\mathbb{P}\left\{\widehat{m}_{i}\notin[\frac{2}{3}m_{i},2m_{i}]\right\}\leq 2N% \exp\left(-\frac{Tp_{+}}{648Nm_{i}}\right)\leq 2N\exp\left(-\frac{Tp_{+}}{648N% \max(m_{i})}\right)blackboard_P { over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ [ divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 2 italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] } ≤ 2 italic_N roman_exp ( - divide start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG 648 italic_N italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) ≤ 2 italic_N roman_exp ( - divide start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG 648 italic_N roman_max ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ).

Union bounding over all contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] and the two Algorithms 3 and 4, we obtain {i[k] in Algorithms 34  with m^i[23mi,2mi]}2Nkexp(Tp+648Nmax(mi))𝑖delimited-[]𝑘 in Algorithms 34  with subscript^𝑚𝑖23subscript𝑚𝑖2subscript𝑚𝑖2𝑁𝑘𝑇subscript𝑝648𝑁subscript𝑚𝑖\mathbb{P}\left\{\exists i\in[k]\text{ in Algorithms \ref{alg:% estimateCausalParameters}, \ref{alg:estimateRewards} }\text{ with }\widehat{m}% _{i}\notin[\frac{2}{3}m_{i},2m_{i}]\right\}\leq 2Nk\exp\left(-\frac{Tp_{+}}{64% 8N\max(m_{i})}\right)blackboard_P { ∃ italic_i ∈ [ italic_k ] in Algorithms , with over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ [ divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 2 italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] } ≤ 2 italic_N italic_k roman_exp ( - divide start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG 648 italic_N roman_max ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ).Finally, substituting the value of T648max(mi)Np+log(2NT)𝑇648subscript𝑚𝑖𝑁subscript𝑝2𝑁𝑇T\geq\frac{648\max(m_{i})N}{p_{+}}\log\left(2NT\right)italic_T ≥ divide start_ARG 648 roman_max ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_N end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG roman_log ( 2 italic_N italic_T ), gives us {i[k] in Algorithms 34  with m^i[23mi,2mi]}2Nkexp(p+648Nmax(mi)[648max(mi)Np+log(2NT)])=kT𝑖delimited-[]𝑘 in Algorithms 34  with subscript^𝑚𝑖23subscript𝑚𝑖2subscript𝑚𝑖2𝑁𝑘subscript𝑝648𝑁subscript𝑚𝑖delimited-[]648subscript𝑚𝑖𝑁subscript𝑝2𝑁𝑇𝑘𝑇\mathbb{P}\left\{\exists i\in[k]\text{ in Algorithms \ref{alg:% estimateCausalParameters}, \ref{alg:estimateRewards} }\text{ with }\widehat{m}% _{i}\notin[\frac{2}{3}m_{i},2m_{i}]\right\}\leq 2Nk\exp\left(-\frac{p_{+}}{648% N\max(m_{i})}\cdot\left[\frac{648\max(m_{i})N}{p_{+}}\log\left(2NT\right)% \right]\right)=\frac{k}{T}blackboard_P { ∃ italic_i ∈ [ italic_k ] in Algorithms , with over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ [ divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 2 italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] } ≤ 2 italic_N italic_k roman_exp ( - divide start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG 648 italic_N roman_max ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ⋅ [ divide start_ARG 648 roman_max ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_N end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG roman_log ( 2 italic_N italic_T ) ] ) = divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG. This completes the proof. ∎

D.3 Bound on E4subscript𝐸4E_{4}italic_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT:

The following lemma provides an upper bound for {¬E4E2}conditional-setsubscript𝐸4subscript𝐸2\mathbb{P}\{\neg E_{4}\mid E_{2}\}blackboard_P { ¬ italic_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∣ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }.

Lemma 8.

Let ζ:=150m0Tp+log(3Tk)assign𝜁150subscript𝑚0𝑇subscript𝑝3𝑇𝑘\zeta:=\sqrt{\frac{150m_{0}}{Tp_{+}}\log\left(\frac{3T}{k}\right)}italic_ζ := square-root start_ARG divide start_ARG 150 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 3 italic_T end_ARG start_ARG italic_k end_ARG ) end_ARG. Then, {¬E4E2}={i[k]|P(a,i)P^(a,i)|>ζ|E2}kTconditional-setsubscript𝐸4subscript𝐸2conditional-setsubscript𝑖delimited-[]𝑘subscript𝑃𝑎𝑖subscript^𝑃𝑎𝑖𝜁subscript𝐸2𝑘𝑇\mathbb{P}\{\neg E_{4}\mid E_{2}\}=\mathbb{P}\left\{\sum\limits_{i\in[k]}\left% \lvert P_{(a,i)}-\widehat{P}_{(a,i)}\right\rvert>\zeta\big{|}E_{2}\right\}\leq% \frac{k}{T}blackboard_P { ¬ italic_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∣ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } = blackboard_P { ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT | italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT - over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | > italic_ζ | italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ≤ divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG.

Proof.

As in the proof of Lemma 4, we will use Devroye’s inequality. Write Tasubscript𝑇𝑎T_{a}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to denote the number of times intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is observed (in state 00) in Algorithm 2. For any η(0,1)𝜂01\eta\in(0,1)italic_η ∈ ( 0 , 1 ) and with Ta20sη2subscript𝑇𝑎20𝑠superscript𝜂2T_{a}\geq\frac{20s}{\eta^{2}}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≥ divide start_ARG 20 italic_s end_ARG start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, Devroye’s inequality gives us {i=1k|P^(a,i)P(a,i)|>η}3exp(Taη225)superscriptsubscript𝑖1𝑘subscript^𝑃𝑎𝑖subscript𝑃𝑎𝑖𝜂3subscript𝑇𝑎superscript𝜂225\mathbb{P}\left\{\sum\limits_{i=1}^{k}\lvert\widehat{P}_{(a,i)}-P_{(a,i)}% \rvert>\eta\right\}\leq 3\exp\left(-\frac{T_{a}\eta^{2}}{25}\right)blackboard_P { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | > italic_η } ≤ 3 roman_exp ( - divide start_ARG italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 25 end_ARG ). Here, s𝑠sitalic_s is the size of the support of the multinomial distribution.

We first show that Tasubscript𝑇𝑎T_{a}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is sufficiently large, for each intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Recall that we allocate time T3𝑇3\frac{T}{3}divide start_ARG italic_T end_ARG start_ARG 3 end_ARG to Algorithm 2. Furthermore, we observe each intervention in state 00, at least T3m^0𝑇3subscript^𝑚0\frac{T}{3\widehat{m}_{0}}divide start_ARG italic_T end_ARG start_ARG 3 over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG times, either as part of the do-nothing intervention or explicitly in Step 10 of Algorithm 2. Now, event E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ensures that m^0[23m0,2m0]subscript^𝑚023subscript𝑚02subscript𝑚0\widehat{m}_{0}\in[\frac{2}{3}m_{0},2m_{0}]over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ [ divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 2 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ]. Hence, each intervention a𝒜0𝑎subscript𝒜0a\in\mathcal{A}_{0}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is observed TaT3m^0T32m0=T6m0subscript𝑇𝑎𝑇3subscript^𝑚0𝑇32subscript𝑚0𝑇6subscript𝑚0T_{a}\geq\frac{T}{3\widehat{m}_{0}}\geq\frac{T}{3\cdot 2m_{0}}=\frac{T}{6m_{0}}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≥ divide start_ARG italic_T end_ARG start_ARG 3 over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ≥ divide start_ARG italic_T end_ARG start_ARG 3 ⋅ 2 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_T end_ARG start_ARG 6 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG times.

Substituting this inequality for Tasubscript𝑇𝑎T_{a}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT in the above-mentioned probability bound, we obtain
{i=1k|P^(a,i)P(a,i)|>η}3exp(Tη2150m0)superscriptsubscript𝑖1𝑘subscript^𝑃𝑎𝑖subscript𝑃𝑎𝑖𝜂3𝑇superscript𝜂2150subscript𝑚0\mathbb{P}\left\{\sum\limits_{i=1}^{k}\lvert\widehat{P}_{(a,i)}-P_{(a,i)}% \rvert>\eta\right\}\leq 3\exp\left(-\frac{T\eta^{2}}{150m_{0}}\right)blackboard_P { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | > italic_η } ≤ 3 roman_exp ( - divide start_ARG italic_T italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 150 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) when T120sm0η2𝑇120𝑠subscript𝑚0superscript𝜂2T\geq\frac{120sm_{0}}{\eta^{2}}italic_T ≥ divide start_ARG 120 italic_s italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. As observed in Lemma 4, the support size s𝑠sitalic_s is at most 1p+1subscript𝑝\frac{1}{p_{+}}divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG. Therefore, the requirement on T𝑇Titalic_T reduces to T120m0η2p+𝑇120subscript𝑚0superscript𝜂2subscript𝑝T\geq\frac{120m_{0}}{\eta^{2}p_{+}}italic_T ≥ divide start_ARG 120 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG.

Setting η=150m0Tp+log(3Tk)𝜂150subscript𝑚0𝑇subscript𝑝3𝑇𝑘\eta=\sqrt{\frac{150m_{0}}{Tp_{+}}\log\left(\frac{3T}{k}\right)}italic_η = square-root start_ARG divide start_ARG 150 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 3 italic_T end_ARG start_ARG italic_k end_ARG ) end_ARG gives us

{i=1k|P^(a,i)P(a,i)|>150m0Tp+log(3Tk)}superscriptsubscript𝑖1𝑘subscript^𝑃𝑎𝑖subscript𝑃𝑎𝑖150subscript𝑚0𝑇subscript𝑝3𝑇𝑘\displaystyle\mathbb{P}\left\{\sum\limits_{i=1}^{k}\lvert\widehat{P}_{(a,i)}-P% _{(a,i)}\rvert>\sqrt{\frac{150m_{0}}{Tp_{+}}\log\left(\frac{3T}{k}\right)}\right\}blackboard_P { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | > square-root start_ARG divide start_ARG 150 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 3 italic_T end_ARG start_ARG italic_k end_ARG ) end_ARG } 3exp(T150m0[150m0Tp+log(3Tk)]2)absent3𝑇150subscript𝑚0superscriptdelimited-[]150subscript𝑚0𝑇subscript𝑝3𝑇𝑘2\displaystyle\leq 3\exp\left(\frac{-T}{150m_{0}}\left[\sqrt{\frac{150m_{0}}{Tp% _{+}}\log\left(\frac{3T}{k}\right)}\right]^{2}\right)≤ 3 roman_exp ( divide start_ARG - italic_T end_ARG start_ARG 150 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG [ square-root start_ARG divide start_ARG 150 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 3 italic_T end_ARG start_ARG italic_k end_ARG ) end_ARG ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
kT.absent𝑘𝑇\displaystyle\leq\frac{k}{T}.≤ divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG .

Therefore {i=1k|P^(a,i)P(a,i)|>η}kTsuperscriptsubscript𝑖1𝑘subscript^𝑃𝑎𝑖subscript𝑃𝑎𝑖𝜂𝑘𝑇\mathbb{P}\left\{\sum\limits_{i=1}^{k}\lvert\widehat{P}_{(a,i)}-P_{(a,i)}% \rvert>\eta\right\}\leq\frac{k}{T}blackboard_P { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | > italic_η } ≤ divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG, and this probability bound requires T120m0η2p+𝑇120subscript𝑚0superscript𝜂2subscript𝑝T\geq\frac{120m_{0}}{\eta^{2}p_{+}}italic_T ≥ divide start_ARG 120 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG. That is, η120m0Tp+𝜂120subscript𝑚0𝑇subscript𝑝\eta\geq\sqrt{\frac{120m_{0}}{Tp_{+}}}italic_η ≥ square-root start_ARG divide start_ARG 120 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG end_ARG. This inequality is satisfied by our choice of η𝜂\etaitalic_η. Hence, the lemma stands proved. ∎

D.4 Bound on ¬E5subscript𝐸5\neg E_{5}¬ italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT

The next lemma bounds {¬E5E1,E3}conditional-setsubscript𝐸5subscript𝐸1subscript𝐸3\mathbb{P}\{\neg E_{5}\mid E_{1},E_{3}\}blackboard_P { ¬ italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ∣ italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }.

Lemma 9.

Let η^i=27m^iT(P^f^)ilog(2TN)subscript^𝜂𝑖27subscript^𝑚𝑖𝑇subscriptsuperscript^𝑃topsuperscript^𝑓𝑖2𝑇𝑁\widehat{\eta}_{i}=\sqrt{\frac{27\widehat{m}_{i}}{T(\widehat{P}^{\top}\widehat% {f}^{*})_{i}}\log\left(2TN\right)}over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 27 over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_T ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_log ( 2 italic_T italic_N ) end_ARG. Then, {¬E5E3,E1}kTconditional-setsubscript𝐸5subscript𝐸3subscript𝐸1𝑘𝑇\mathbb{P}\{\neg E_{5}\mid E_{3},E_{1}\}\leq\frac{k}{T}blackboard_P { ¬ italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ∣ italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } ≤ divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG. In other words:

{i[k] and a𝒜i such that |𝔼[Ria]^(a,i)|>η^iE3,E1}kT𝑖delimited-[]𝑘 and 𝑎subscript𝒜𝑖 such that 𝔼delimited-[]conditionalsubscript𝑅𝑖𝑎subscript^𝑎𝑖conditionalsubscript^𝜂𝑖subscript𝐸3subscript𝐸1𝑘𝑇\mathbb{P}\left\{\exists i\in[k]\text{ and }a\in\mathcal{A}_{i}\text{ such % that }\left\lvert\mathbb{E}\left[R_{i}\mid a\right]-\widehat{\mathcal{R}}_{(a,% i)}\right\rvert>\widehat{\eta}_{i}\ \mid\ E_{3},E_{1}\right\}\leq\frac{k}{T}blackboard_P { ∃ italic_i ∈ [ italic_k ] and italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that | blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_a ] - over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | > over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } ≤ divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG

.

Proof.

For intermediate contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ], we denote the realization of the causal parameters misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the transition probabilities P𝑃Pitalic_P in Algorithm 4, as m~isubscript~𝑚𝑖\widetilde{m}_{i}over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and P~~𝑃\widetilde{P}over~ start_ARG italic_P end_ARG, respectively. The estimates in the previous subroutines are denoted by m^isubscript^𝑚𝑖\widehat{m}_{i}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG.

Event E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT gives us P(a,i)[34P^(a,i),32P^(a,i)]subscript𝑃𝑎𝑖34subscript^𝑃𝑎𝑖32subscript^𝑃𝑎𝑖P_{(a,i)}\in[\frac{3}{4}\widehat{P}_{(a,i)},\frac{3}{2}\widehat{P}_{(a,i)}]italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ∈ [ divide start_ARG 3 end_ARG start_ARG 4 end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT , divide start_ARG 3 end_ARG start_ARG 2 end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ]and P~(a,i)[23P(a,i),43P(a,i)]subscript~𝑃𝑎𝑖23subscript𝑃𝑎𝑖43subscript𝑃𝑎𝑖\widetilde{P}_{(a,i)}\in[\frac{2}{3}P_{(a,i)},\frac{4}{3}P_{(a,i)}]over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ∈ [ divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT , divide start_ARG 4 end_ARG start_ARG 3 end_ARG italic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ]. Hence, the estimates across the subroutines are close enough: P~(a,i)[12P^(a,i),2P^(a,i)]subscript~𝑃𝑎𝑖12subscript^𝑃𝑎𝑖2subscript^𝑃𝑎𝑖\widetilde{P}_{(a,i)}\in[\frac{1}{2}\widehat{P}_{(a,i)},2\widehat{P}_{(a,i)}]over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ∈ [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT , 2 over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ]. Similarly, event E3subscript𝐸3E_{3}italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT gives us m~i[13m^i,3m^i]subscript~𝑚𝑖13subscript^𝑚𝑖3subscript^𝑚𝑖\widetilde{m}_{i}\in[\frac{1}{3}\widehat{m}_{i},3\widehat{m}_{i}]over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ divide start_ARG 1 end_ARG start_ARG 3 end_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 3 over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ].

Write T~isubscript~𝑇𝑖\widetilde{T}_{i}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote the number of times context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] was visited in Algorithm 4. For all contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ], we first establish a useful lower bound on T~isubscript~𝑇𝑖\widetilde{T}_{i}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, under events E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and E3subscript𝐸3E_{3}italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. The relevant observation here is that the estimate P~(α,i)subscript~𝑃𝛼𝑖\widetilde{P}_{(\alpha,i)}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_α , italic_i ) end_POSTSUBSCRIPT was computed in Algorithm 4 by counting the number of times context i𝑖iitalic_i was visited with intervention α𝒜0𝛼subscript𝒜0\alpha\in\mathcal{A}_{0}italic_α ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (at state 00). By construction, in Algorithm 4 each intervention α𝒜0𝛼subscript𝒜0\alpha\in\mathcal{A}_{0}italic_α ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT was performed at least f^α3T3subscriptsuperscript^𝑓𝛼3𝑇3\frac{\widehat{f}^{*}_{\alpha}}{3}\frac{T}{3}divide start_ARG over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG start_ARG 3 end_ARG divide start_ARG italic_T end_ARG start_ARG 3 end_ARG times. Furthermore, given that P~(α,i)subscript~𝑃𝛼𝑖\widetilde{P}_{(\alpha,i)}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_α , italic_i ) end_POSTSUBSCRIPT was computed via the visitation count, we get that context i𝑖iitalic_i is visited with intervention α𝒜0𝛼subscript𝒜0\alpha\in\mathcal{A}_{0}italic_α ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at least P~(α,i)Tf^α9subscript~𝑃𝛼𝑖𝑇subscriptsuperscript^𝑓𝛼9\widetilde{P}_{(\alpha,i)}\frac{T\widehat{f}^{*}_{\alpha}}{9}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_α , italic_i ) end_POSTSUBSCRIPT divide start_ARG italic_T over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG start_ARG 9 end_ARG times. Therefore, T~iα𝒜0P~(α,i)Tf^α9=T9(P~f^)iT18(P^f^)isubscript~𝑇𝑖subscript𝛼subscript𝒜0subscript~𝑃𝛼𝑖𝑇subscriptsuperscript^𝑓𝛼9𝑇9subscriptsuperscript~𝑃topsuperscript^𝑓𝑖𝑇18subscriptsuperscript^𝑃topsuperscript^𝑓𝑖\widetilde{T}_{i}\geq\sum_{\alpha\in\mathcal{A}_{0}}\ \widetilde{P}_{(\alpha,i% )}\frac{T\widehat{f}^{*}_{\alpha}}{9}=\frac{T}{9}(\widetilde{P}^{\top}\widehat% {f}^{*})_{i}\geq\frac{T}{18}(\widehat{P}^{\top}\widehat{f}^{*})_{i}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ ∑ start_POSTSUBSCRIPT italic_α ∈ caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT ( italic_α , italic_i ) end_POSTSUBSCRIPT divide start_ARG italic_T over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG start_ARG 9 end_ARG = divide start_ARG italic_T end_ARG start_ARG 9 end_ARG ( over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ divide start_ARG italic_T end_ARG start_ARG 18 end_ARG ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Here, the last inequality follows from the above-mentioned proximity between P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG and P~~𝑃\widetilde{P}over~ start_ARG italic_P end_ARG.

Now, note that, at each context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ], Algorithm 4 (by construction) observes every intervention a𝒜i𝑎subscript𝒜𝑖a\in\mathcal{A}_{i}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at least T~im~isubscript~𝑇𝑖subscript~𝑚𝑖\frac{\widetilde{T}_{i}}{\widetilde{m}_{i}}divide start_ARG over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG times. Write T~(a,i)subscript~𝑇𝑎𝑖\widetilde{T}_{(a,i)}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT to denote the number of times intervention a𝒜i𝑎subscript𝒜𝑖a\in\mathcal{A}_{i}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is observed in this subroutine. Hence,

T~(a,i)T~im~i1m~iT18(P^f^)i13m^iT18(P^f^)isubscript~𝑇𝑎𝑖subscript~𝑇𝑖subscript~𝑚𝑖1subscript~𝑚𝑖𝑇18subscriptsuperscript^𝑃topsuperscript^𝑓𝑖13subscript^𝑚𝑖𝑇18subscriptsuperscript^𝑃topsuperscript^𝑓𝑖\widetilde{T}_{(a,i)}\geq\frac{\widetilde{T}_{i}}{\widetilde{m}_{i}}\geq\frac{% 1}{\widetilde{m}_{i}}\frac{T}{18}(\widehat{P}^{\top}\widehat{f}^{*})_{i}\geq% \frac{1}{3\widehat{m}_{i}}\frac{T}{18}(\widehat{P}^{\top}\widehat{f}^{*})_{i}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ≥ divide start_ARG over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≥ divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG divide start_ARG italic_T end_ARG start_ARG 18 end_ARG ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 3 over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG divide start_ARG italic_T end_ARG start_ARG 18 end_ARG ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (6)

For each context i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] and intervention a𝒜i𝑎subscript𝒜𝑖a\in\mathcal{A}_{i}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, define the event ¬E5(a,i)subscript𝐸5𝑎𝑖\neg E_{5}(a,i)¬ italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ( italic_a , italic_i ) as |𝔼[Ria]^(a,i)|>η^i𝔼delimited-[]conditionalsubscript𝑅𝑖𝑎subscript^𝑎𝑖subscript^𝜂𝑖\lvert\mathbb{E}\left[R_{i}\mid a\right]-\widehat{\mathcal{R}}_{(a,i)}\rvert>% \widehat{\eta}_{i}| blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_a ] - over^ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT | > over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Hoeffding’s inequality gives us {¬E5(a,i)E1,E3}2exp(2T~(a,i)η^i2)2exp(T(P^f^)iη^i227m^i)conditional-setsubscript𝐸5𝑎𝑖subscript𝐸1subscript𝐸322subscript~𝑇𝑎𝑖superscriptsubscript^𝜂𝑖22𝑇subscriptsuperscript^𝑃topsuperscript^𝑓𝑖superscriptsubscript^𝜂𝑖227subscript^𝑚𝑖\mathbb{P}\left\{\neg E_{5}{(a,i)}\mid E_{1},E_{3}\right\}\leq 2\exp\left(-2% \widetilde{T}_{(a,i)}\widehat{\eta}_{i}^{2}\right)\leq 2\exp\left(-T\frac{(% \widehat{P}^{\top}\widehat{f}^{*})_{i}\widehat{\eta}_{i}^{2}}{27\widehat{m}_{i% }}\right)blackboard_P { ¬ italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ( italic_a , italic_i ) ∣ italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } ≤ 2 roman_exp ( - 2 over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ 2 roman_exp ( - italic_T divide start_ARG ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 27 over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ). The last inequality is obtained by substituting Equation 6.

Recall that η^i=27m^iT(P^f^)ilog(2TN)subscript^𝜂𝑖27subscript^𝑚𝑖𝑇subscriptsuperscript^𝑃topsuperscript^𝑓𝑖2𝑇𝑁\widehat{\eta}_{i}=\sqrt{\frac{27\widehat{m}_{i}}{T(\widehat{P}^{\top}\widehat% {f}^{*})_{i}}\log\left(2TN\right)}over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 27 over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_T ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_log ( 2 italic_T italic_N ) end_ARG. Hence, the previous inequality corresponds to {¬E5(a,i)E1,E3}2exp(T(P^f^)i27m^i[27m^iT(P^f^)ilog(2TN)]2)=1TNconditional-setsubscript𝐸5𝑎𝑖subscript𝐸1subscript𝐸32𝑇subscriptsuperscript^𝑃topsuperscript^𝑓𝑖27subscript^𝑚𝑖superscriptdelimited-[]27subscript^𝑚𝑖𝑇subscriptsuperscript^𝑃topsuperscript^𝑓𝑖2𝑇𝑁21𝑇𝑁\mathbb{P}\left\{\neg E_{5}{(a,i)}\mid E_{1},E_{3}\right\}\leq 2\exp\left(-T% \frac{(\widehat{P}^{\top}\widehat{f}^{*})_{i}}{27\widehat{m}_{i}}\cdot\left[% \sqrt{\frac{27\widehat{m}_{i}}{T(\widehat{P}^{\top}\widehat{f}^{*})_{i}}\log% \left(2TN\right)}\right]^{2}\right)=\frac{1}{TN}blackboard_P { ¬ italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ( italic_a , italic_i ) ∣ italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } ≤ 2 roman_exp ( - italic_T divide start_ARG ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 27 over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ⋅ [ square-root start_ARG divide start_ARG 27 over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_T ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_log ( 2 italic_T italic_N ) end_ARG ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_T italic_N end_ARG.

Note that ¬E5=i[k]a𝒜iE5(a,i)subscript𝐸5subscript𝑖delimited-[]𝑘subscript𝑎subscript𝒜𝑖subscript𝐸5𝑎𝑖\neg E_{5}=\bigcup_{i\in[k]}\bigcup_{a\in\mathcal{A}_{i}}E_{5}{(a,i)}¬ italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT ⋃ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ( italic_a , italic_i ). Taking a union bound over all contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] and interventions a𝒜i𝑎subscript𝒜𝑖a\in\mathcal{A}_{i}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we obtain {¬E5E1,E3}kNTN=kTconditional-setsubscript𝐸5subscript𝐸1subscript𝐸3𝑘𝑁𝑇𝑁𝑘𝑇\mathbb{P}\{\neg E_{5}\mid E_{1},E_{3}\}\leq\frac{kN}{TN}=\frac{k}{T}blackboard_P { ¬ italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ∣ italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } ≤ divide start_ARG italic_k italic_N end_ARG start_ARG italic_T italic_N end_ARG = divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG. This completes the proof. ∎

D.5 Bound on bad event (F):

Write T0:=𝒪(Nmax(mi)p+3log(2NT))=O~(Nmax(mi)p+3)assignsubscript𝑇0𝒪𝑁subscript𝑚𝑖superscriptsubscript𝑝32𝑁𝑇~𝑂𝑁subscript𝑚𝑖superscriptsubscript𝑝3T_{0}:=\mathcal{O}\left(\frac{N\max(m_{i})}{p_{+}^{3}}\log\left(2NT\right)% \right)=\widetilde{O}\left(\frac{N\max(m_{i})}{p_{+}^{3}}\right)italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := caligraphic_O ( divide start_ARG italic_N roman_max ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG roman_log ( 2 italic_N italic_T ) ) = over~ start_ARG italic_O end_ARG ( divide start_ARG italic_N roman_max ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ).

Lemma 10.

{F}5kT𝐹5𝑘𝑇\mathbb{P}\{F\}\leq\frac{5k}{T}blackboard_P { italic_F } ≤ divide start_ARG 5 italic_k end_ARG start_ARG italic_T end_ARG for any T>T0𝑇subscript𝑇0T>T_{0}italic_T > italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Proof.

We summarize the statements of Lemmas 4, 6, 7, 8 and 9 as follows. When TT0𝑇subscript𝑇0T\geq T_{0}italic_T ≥ italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT where T0=max{1620Np+3,2025Np+2log(9NTk),144m0log(Tnk),864max(mi)Np+log(2nT)}=𝒪(Nmax(mi)p+3log(2NT))subscript𝑇01620𝑁superscriptsubscript𝑝32025𝑁superscriptsubscript𝑝29𝑁𝑇𝑘144subscript𝑚0𝑇𝑛𝑘864subscript𝑚𝑖𝑁subscript𝑝2𝑛𝑇𝒪𝑁subscript𝑚𝑖superscriptsubscript𝑝32𝑁𝑇T_{0}=\max\left\{\frac{1620N}{p_{+}^{3}},\frac{2025N}{p_{+}^{2}}\log\left(% \frac{9NT}{k}\right),144m_{0}\log\left(\frac{Tn}{k}\right),\frac{864\max(m_{i}% )N}{p_{+}}\log\left(2nT\right)\right\}=\mathcal{O}\left(\frac{N\max(m_{i})}{p_% {+}^{3}}\log\left(2NT\right)\right)italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_max { divide start_ARG 1620 italic_N end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG , divide start_ARG 2025 italic_N end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log ( divide start_ARG 9 italic_N italic_T end_ARG start_ARG italic_k end_ARG ) , 144 italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_T italic_n end_ARG start_ARG italic_k end_ARG ) , divide start_ARG 864 roman_max ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_N end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG roman_log ( 2 italic_n italic_T ) } = caligraphic_O ( divide start_ARG italic_N roman_max ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG roman_log ( 2 italic_N italic_T ) ), we obtain {F}={[i[5]¬Ei]}{¬E1}+{¬E2}+{¬E3E1}+{¬E4E2}+{¬E5E3,E1}5kT𝐹delimited-[]subscript𝑖delimited-[]5subscript𝐸𝑖subscript𝐸1subscript𝐸2conditional-setsubscript𝐸3subscript𝐸1conditional-setsubscript𝐸4subscript𝐸2conditional-setsubscript𝐸5subscript𝐸3subscript𝐸15𝑘𝑇\mathbb{P}\{F\}=\mathbb{P}\left\{\left[\bigcup_{i\in[5]}\neg E_{i}\right]% \right\}\leq\mathbb{P}\{\neg E_{1}\}+\mathbb{P}\{\neg E_{2}\}+\mathbb{P}\{\neg E% _{3}\mid E_{1}\}+\mathbb{P}\{\neg E_{4}\mid E_{2}\}+\mathbb{P}\{\neg E_{5}\mid E% _{3},E_{1}\}\leq\frac{5k}{T}blackboard_P { italic_F } = blackboard_P { [ ⋃ start_POSTSUBSCRIPT italic_i ∈ [ 5 ] end_POSTSUBSCRIPT ¬ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] } ≤ blackboard_P { ¬ italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } + blackboard_P { ¬ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } + blackboard_P { ¬ italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∣ italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } + blackboard_P { ¬ italic_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∣ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } + blackboard_P { ¬ italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ∣ italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } ≤ divide start_ARG 5 italic_k end_ARG start_ARG italic_T end_ARG. ∎

Appendix E Nature of the Optimization Problem

Proposition E.1.

Let f~=argmaxfq. vectorfmincontexts [k]P^f~𝑓subscriptargmaxfq. vector𝑓subscriptcontexts [k]superscript^𝑃top𝑓\tilde{f}=\operatorname*{arg\,max}\limits_{\text{fq.~{}vector}f}\enspace\min% \limits_{\text{contexts [k]}}\widehat{P}^{\top}fover~ start_ARG italic_f end_ARG = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT fq. vector italic_f end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT contexts [k] end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f. Then, finding f~~𝑓\tilde{f}over~ start_ARG italic_f end_ARG is an LP

Proof.

We rewrite the above maxfq. vectorfmini[k]()subscriptfq. vector𝑓subscript𝑖delimited-[]𝑘\max\limits_{\text{fq.~{}vector}f}\quad\min\limits_{i\in[k]}(\cdot)roman_max start_POSTSUBSCRIPT fq. vector italic_f end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT ( ⋅ ) as a simpler program:

maxfsubscript𝑓\displaystyle\max_{f}\quadroman_max start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT z𝑧\displaystyle zitalic_z
subject to P^1fzsubscriptsuperscript^𝑃top1𝑓𝑧\displaystyle\widehat{P}^{\top}_{1}f\geq zover^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_f ≥ italic_z
\displaystyle\dots
P^Nfzsubscriptsuperscript^𝑃top𝑁𝑓𝑧\displaystyle\widehat{P}^{\top}_{N}f\geq zover^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_f ≥ italic_z
f𝟙=1𝑓11\displaystyle f\cdot\mathds{1}=1italic_f ⋅ blackboard_1 = 1
f0succeeds-or-equals𝑓0\displaystyle f\succeq 0italic_f ⪰ 0

Where N=|𝒜0|𝑁subscript𝒜0N=\lvert\mathcal{A}_{0}\rvertitalic_N = | caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT |. This is equivalent to the standard form of a linear program, and hence is an LP. ∎

Lemma 11.

minfq. vectorfmaxinterventions 𝒜0P^M^12[P^f]12subscriptfq. vector𝑓subscriptinterventions subscript𝒜0^𝑃superscript^𝑀12superscriptdelimited-[]superscript^𝑃top𝑓absent12\min\limits_{\text{fq.~{}vector}f}\quad\max\limits_{\text{interventions }% \mathcal{A}_{0}}\widehat{P}\hat{M}^{\frac{1}{2}}\left[\widehat{P}^{\top}f% \right]^{\circ-\frac{1}{2}}roman_min start_POSTSUBSCRIPT fq. vector italic_f end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT interventions caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT [ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ] start_POSTSUPERSCRIPT ∘ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT is a convex optimization problem

Proof.

First we write the min\minroman_min-max\maxroman_max in terms of a single minimization. First let us use the shorthand A:=P^M^12assign𝐴^𝑃superscript^𝑀12A:=\widehat{P}\hat{M}^{\frac{1}{2}}italic_A := over^ start_ARG italic_P end_ARG over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT and {A1,,AN}subscript𝐴1subscript𝐴𝑁\{A_{1},\dots,A_{N}\}{ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } (where N:=|𝒜0|assign𝑁subscript𝒜0N:=\lvert\mathcal{A}_{0}\rvertitalic_N := | caligraphic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT |) denote the rows of the matrix

OPT:minf:OPTsubscript𝑓\displaystyle\textbf{OPT}:\min_{f}\quadOPT : roman_min start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT z𝑧\displaystyle zitalic_z
subject to A1[P^f]12zsubscript𝐴1superscriptdelimited-[]superscript^𝑃top𝑓absent12𝑧\displaystyle A_{1}\cdot\left[\widehat{P}^{\top}f\right]^{\circ-\frac{1}{2}}\leq zitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ [ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ] start_POSTSUPERSCRIPT ∘ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ≤ italic_z
\displaystyle\dots
AN[P^f]12zsubscript𝐴𝑁superscriptdelimited-[]superscript^𝑃top𝑓absent12𝑧\displaystyle A_{N}\cdot\left[\widehat{P}^{\top}f\right]^{\circ-\frac{1}{2}}\leq zitalic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⋅ [ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ] start_POSTSUPERSCRIPT ∘ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ≤ italic_z (7)
f𝟙=1𝑓11\displaystyle f\cdot\mathds{1}=1italic_f ⋅ blackboard_1 = 1
f0succeeds-or-equals𝑓0\displaystyle f\succeq 0italic_f ⪰ 0
Proposition E.2.

For any a+𝑎subscripta\in\mathbb{R}_{+}italic_a ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, the function g(x):=ax12assign𝑔𝑥𝑎superscript𝑥12g(x):=ax^{-\frac{1}{2}}italic_g ( italic_x ) := italic_a italic_x start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT is convex in x𝑥xitalic_x.

Proof.

We observe that the second derivative is positive. ∎

Proposition E.3.

The constraint equations of OPT are convex in f𝑓fitalic_f

Proof.

Consider the first constraint of the problem. We can simplify this to get i[k]A1iP^(,i)fsubscript𝑖delimited-[]𝑘subscript𝐴1𝑖^𝑃superscript𝑖top𝑓\sum_{i\in[k]}\frac{A_{1i}}{\sqrt{\widehat{P}(*,i)^{\top}f}}∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT divide start_ARG italic_A start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over^ start_ARG italic_P end_ARG ( ∗ , italic_i ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f end_ARG end_ARG.

Note that the i𝑖iitalic_ith term in the summand (i.e, A1iP^(,i)fsubscript𝐴1𝑖^𝑃superscript𝑖top𝑓\frac{A_{1i}}{\sqrt{\widehat{P}(*,i)^{\top}f}}divide start_ARG italic_A start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over^ start_ARG italic_P end_ARG ( ∗ , italic_i ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f end_ARG end_ARG) is of the form f(x)=c(vx)12𝑓𝑥𝑐superscriptsuperscript𝑣top𝑥12f(x)=c(v^{\top}x)^{-\frac{1}{2}}italic_f ( italic_x ) = italic_c ( italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT for some c+𝑐subscriptc\in\mathbb{R}_{+}italic_c ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and v+N𝑣subscriptsuperscript𝑁v\in\mathbb{R}^{N}_{+}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Let x1,x2Nsubscript𝑥1subscript𝑥2superscript𝑁x_{1},x_{2}\in\mathbb{R}^{N}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT be any two vectors, and scalar λ[0,1]𝜆01\lambda\in[0,1]italic_λ ∈ [ 0 , 1 ]. We wish to show that f(λx1+(1λ)x2)λf(x1)+(1λ)f(x2)𝑓𝜆subscript𝑥11𝜆subscript𝑥2𝜆𝑓subscript𝑥11𝜆𝑓subscript𝑥2f(\lambda x_{1}+(1-\lambda)x_{2})\leq\lambda f(x_{1})+(1-\lambda)f(x_{2})italic_f ( italic_λ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_λ ) italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≤ italic_λ italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ( 1 - italic_λ ) italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

We have f(λx1+(1λ)x2)=c(v(λx1+(1λ)x2))12=c(λvx1+(1λ)vx2)12𝑓𝜆subscript𝑥11𝜆subscript𝑥2𝑐superscriptsuperscript𝑣top𝜆subscript𝑥11𝜆subscript𝑥212𝑐superscript𝜆superscript𝑣topsubscript𝑥11𝜆superscript𝑣topsubscript𝑥212f(\lambda x_{1}+(1-\lambda)x_{2})=c(v^{\top}(\lambda x_{1}+(1-\lambda)x_{2}))^% {-\frac{1}{2}}=c(\lambda v^{\top}x_{1}+(1-\lambda)v^{\top}x_{2})^{-\frac{1}{2}}italic_f ( italic_λ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_λ ) italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_c ( italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_λ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_λ ) italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT = italic_c ( italic_λ italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_λ ) italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT

But ax12𝑎superscript𝑥12ax^{-\frac{1}{2}}italic_a italic_x start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT is convex as per Proposition E.2. Therefore c(λvx1+(1λ)vx2)12λc(vx1)12+(1λ)c(vx2)12=λf(x1)+(1λ)f(x2)𝑐superscript𝜆superscript𝑣topsubscript𝑥11𝜆superscript𝑣topsubscript𝑥212𝜆𝑐superscriptsuperscript𝑣topsubscript𝑥1121𝜆𝑐superscriptsuperscript𝑣topsubscript𝑥212𝜆𝑓subscript𝑥11𝜆𝑓subscript𝑥2c(\lambda v^{\top}x_{1}+(1-\lambda)v^{\top}x_{2})^{-\frac{1}{2}}\leq\lambda c(% v^{\top}x_{1})^{-\frac{1}{2}}+(1-\lambda)c(v^{\top}x_{2})^{-\frac{1}{2}}=% \lambda f(x_{1})+(1-\lambda)f(x_{2})italic_c ( italic_λ italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_λ ) italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ≤ italic_λ italic_c ( italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT + ( 1 - italic_λ ) italic_c ( italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT = italic_λ italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ( 1 - italic_λ ) italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), as required.

Since A1iP^(,i)fsubscript𝐴1𝑖^𝑃superscript𝑖top𝑓\frac{A_{1i}}{\sqrt{\widehat{P}(*,i)^{\top}f}}divide start_ARG italic_A start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over^ start_ARG italic_P end_ARG ( ∗ , italic_i ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f end_ARG end_ARG is convex, the sum i[k]A1iP^(,i)fsubscript𝑖delimited-[]𝑘subscript𝐴1𝑖^𝑃superscript𝑖top𝑓\sum_{i\in[k]}\frac{A_{1i}}{\sqrt{\widehat{P}(*,i)^{\top}f}}∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT divide start_ARG italic_A start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over^ start_ARG italic_P end_ARG ( ∗ , italic_i ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f end_ARG end_ARG is convex as well. Similarly, all the other constraints are also convex. ∎

Since the constraints are convex in f𝑓fitalic_f and the objective is linear, OPT is convex. ∎

Appendix F Lower Bounds

This section establishes Theorem 2. We will identify a collection of instances for causal bandits with intermediate feedback and show that, for any given algorithm 𝒜𝒜\mathscr{A}script_A, there exists an instance in this collection for which 𝒜𝒜\mathscr{A}script_A’s regret is Ω(λT)Ω𝜆𝑇\Omega\left(\sqrt{\frac{\lambda}{T}}\right)roman_Ω ( square-root start_ARG divide start_ARG italic_λ end_ARG start_ARG italic_T end_ARG end_ARG ).

First we describe the collection of instances and then provide the proof.

For any integer k>1𝑘1k>1italic_k > 1, consider n=(k1)𝑛𝑘1n=(k-1)italic_n = ( italic_k - 1 ) causal variables at each context i{0,1,,k}𝑖01𝑘i\in\{0,1,\dots,k\}italic_i ∈ { 0 , 1 , … , italic_k }. The transition matrix P𝑃Pitalic_P is set to be deterministic. Specifically, for each i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ], we have {ido(Xi0=1)}=1conditional-set𝑖𝑑𝑜superscriptsubscript𝑋𝑖011\mathbb{P}\{i\mid do(X_{i}^{0}=1)\}=1blackboard_P { italic_i ∣ italic_d italic_o ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = 1 ) } = 1. For all other interventions at context 0, we transition to context k with probability 1. Such a transition matrix can be achieved by setting qi0=0subscriptsuperscript𝑞0𝑖0q^{0}_{i}=0italic_q start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 for all i[k1]𝑖delimited-[]𝑘1i\in[k-1]italic_i ∈ [ italic_k - 1 ]. As before, the total number of interventions N:=2n+1=2k1assign𝑁2𝑛12𝑘1N:=2n+1=2k-1italic_N := 2 italic_n + 1 = 2 italic_k - 1.

Now consider a family of Nk+1𝑁𝑘1Nk+1italic_N italic_k + 1 instances666Note the change in notation. We used the term i,jsubscript𝑖𝑗\mathcal{F}_{i,j}caligraphic_F start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT instead of (a,i)subscript𝑎𝑖\mathcal{F}_{(a,i)}caligraphic_F start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT in the main paper. This has been amended in a later version of the main paper. {0}{(a,i)}i[k],a𝒜isubscript0subscriptsubscript𝑎𝑖formulae-sequence𝑖delimited-[]𝑘𝑎subscript𝒜𝑖\left\{\mathcal{F}_{0}\right\}\cup\left\{\mathcal{F}_{(a,i)}\right\}_{i\in[k],% a\in\mathcal{A}_{i}}{ caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } ∪ { caligraphic_F start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] , italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Here, 0subscript0\mathcal{F}_{0}caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and each (a,i)subscript𝑎𝑖\mathcal{F}_{(a,i)}caligraphic_F start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT is an instance of a causal bandit with intermediate feedback with the above-mentioned transition probabilities. The instances differ in the rewards at the intermediate contexts. In particular, in instance 0subscript0\mathcal{F}_{0}caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we set the reward distributions such that 𝔼[Ria]=12𝔼delimited-[]conditionalsubscript𝑅𝑖𝑎12\mathbb{E}[R_{i}\mid a]=\frac{1}{2}blackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_a ] = divide start_ARG 1 end_ARG start_ARG 2 end_ARG for all contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] and interventions a𝒜i𝑎subscript𝒜𝑖a\in\mathcal{A}_{i}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For each i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] and a𝒜i𝑎subscript𝒜𝑖a\in\mathcal{A}_{i}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, instance (a,i)subscript𝑎𝑖\mathcal{F}_{(a,i)}caligraphic_F start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT differs from 0subscript0\mathcal{F}_{0}caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT only at context i𝑖iitalic_i and for intervention a𝑎aitalic_a. Specifically, by construction, we will have 𝔼[Ria]=12+β𝔼delimited-[]conditionalsubscript𝑅𝑖𝑎12𝛽\mathbb{E}[R_{i}\mid a]=\frac{1}{2}+\betablackboard_E [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_a ] = divide start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_β, for a parameter β>0𝛽0\beta>0italic_β > 0. The expected rewards under all other interventions will be 1/2121/21 / 2, the same as in 0subscript0\mathcal{F}_{0}caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Given any algorithm 𝒜𝒜\mathscr{A}script_A, we will consider the execution of 𝒜𝒜\mathscr{A}script_A over all the instances in the family. The execution of algorithm 𝒜𝒜\mathscr{A}script_A over each instance induces a trace, which may include the realized transition probabilities P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG, the realized variable probabilities q^jisubscriptsuperscript^𝑞𝑖𝑗\widehat{q}^{i}_{j}over^ start_ARG italic_q end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] and j[n]𝑗delimited-[]𝑛j\in[n]italic_j ∈ [ italic_n ] and the corresponding m^isubscript^𝑚𝑖\widehat{m}_{i}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs, and the realized rewards ^^\widehat{\mathcal{R}}over^ start_ARG caligraphic_R end_ARG. Each of such realizations (random variables) has a corresponding distribution (over many possible runs of the algorithm). We call the measures corresponding to these random variables under the instances 0subscript0\mathcal{F}_{0}caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and (a,i)subscript𝑎𝑖\mathcal{F}_{(a,i)}caligraphic_F start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT as 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒫(a,i)subscript𝒫𝑎𝑖\mathcal{P}_{(a,i)}caligraphic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT, respectively.

F.1 Proof of Theorem 2

For any algorithm 𝒜𝒜\mathscr{A}script_A and given time budget T𝑇Titalic_T, we first consider the 𝒜𝒜\mathscr{A}script_A’s execution over instance 0subscript0\mathcal{F}_{0}caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. As mentioned previously, 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the trace distribution induced by the algorithm for 0subscript0\mathcal{F}_{0}caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In particular, write risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote the expected number of times context i𝑖iitalic_i is visited, ri:=𝔼𝒫0[state i is visited]/Tassignsubscript𝑟𝑖subscript𝔼subscript𝒫0delimited-[]state i is visited𝑇r_{i}:=\mathbb{E}_{\mathcal{P}_{0}}\left[\text{state $i$ is visited}\right]/Titalic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := blackboard_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ state italic_i is visited ] / italic_T.

Recall that mi:=max{jq(j)i<1j}assignsubscript𝑚𝑖conditional𝑗subscriptsuperscript𝑞𝑖𝑗1𝑗m_{i}:=\max\{j\mid q^{i}_{(j)}<\frac{1}{j}\}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := roman_max { italic_j ∣ italic_q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_j ) end_POSTSUBSCRIPT < divide start_ARG 1 end_ARG start_ARG italic_j end_ARG } and 𝒜mi:={do(X(j)i=1)q(j)i<1j}assignsubscript𝒜subscript𝑚𝑖conditional-set𝑑𝑜subscriptsuperscript𝑋𝑖𝑗1subscriptsuperscript𝑞𝑖𝑗1𝑗\mathcal{A}_{m_{i}}:=\{do(X^{i}_{(j)}=1)\mid q^{i}_{(j)}<\frac{1}{j}\}caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT := { italic_d italic_o ( italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_j ) end_POSTSUBSCRIPT = 1 ) ∣ italic_q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_j ) end_POSTSUBSCRIPT < divide start_ARG 1 end_ARG start_ARG italic_j end_ARG }, where the Bernoulli probabilities of the variables at context i𝑖iitalic_i are sorted to satisfy q(1)iq(2)iq(n)isubscriptsuperscript𝑞𝑖1subscriptsuperscript𝑞𝑖2subscriptsuperscript𝑞𝑖𝑛q^{i}_{(1)}\leq q^{i}_{(2)}\leq\dots\leq q^{i}_{(n)}italic_q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ≤ italic_q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( 2 ) end_POSTSUBSCRIPT ≤ ⋯ ≤ italic_q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_n ) end_POSTSUBSCRIPT. Note that these definitions do not depend on the algorithm at hand. The algorithm, however, may choose to perform different interventions different number of times. Write N(a,i)subscript𝑁𝑎𝑖N_{(a,i)}italic_N start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT to denote the expected (under 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) number of times intervention a𝑎aitalic_a is performed by the algorithm at context i𝑖iitalic_i. Furthermore, let random variable T(a,i)subscript𝑇𝑎𝑖T_{(a,i)}italic_T start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT denote the number of times intervention a𝑎aitalic_a is observed at context i𝑖iitalic_i. Hence, 𝔼𝒫0[T(a,i)]subscript𝔼subscript𝒫0delimited-[]subscript𝑇𝑎𝑖\mathbb{E}_{\mathcal{P}_{0}}[T_{(a,i)}]blackboard_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_T start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ] is the expected number of times intervention a𝑎aitalic_a is observed777Note that a𝑎aitalic_a can be observed while performing the do-nothing intervention. Also, the expected value N(a,i)subscript𝑁𝑎𝑖N_{(a,i)}italic_N start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT accounts for the number of times a𝑎aitalic_a is explicitly performed and not just observed..

Using the expected values for algorithm 𝒜𝒜\mathscr{A}script_A and instance 0subscript0\mathcal{F}_{0}caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we define a subset of 𝒜misubscript𝒜subscript𝑚𝑖\mathcal{A}_{m_{i}}caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT as follows: 𝒥i:={a𝒜mi:N(a,i)2Trimi}assignsubscript𝒥𝑖conditional-set𝑎subscript𝒜subscript𝑚𝑖subscript𝑁𝑎𝑖2𝑇subscript𝑟𝑖subscript𝑚𝑖\mathcal{J}_{i}:=\left\{a\in\mathcal{A}_{m_{i}}\ :\ N_{(a,i)}\leq 2\frac{Tr_{i% }}{m_{i}}\right\}caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := { italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : italic_N start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ≤ 2 divide start_ARG italic_T italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG }. The following proposition shows that the size of 𝒥isubscript𝒥𝑖\mathcal{J}_{i}caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is sufficiently large.

Proposition F.1.

The set 𝒥isubscript𝒥𝑖\mathcal{J}_{i}caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is non-empty. In particular,

mi/2|𝒥i|mi.subscript𝑚𝑖2subscript𝒥𝑖subscript𝑚𝑖\displaystyle m_{i}/2\leq\lvert\mathcal{J}_{i}\rvert\leq m_{i}.italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / 2 ≤ | caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .
Proof.

The upper bound on the size of subset 𝒥isubscript𝒥𝑖\mathcal{J}_{i}caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT follows directly from its definition: since 𝒥iImisubscript𝒥𝑖subscript𝐼subscript𝑚𝑖\mathcal{J}_{i}\subseteq I_{m_{i}}caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊆ italic_I start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT we have |𝒥i||𝒜mi|=misubscript𝒥𝑖subscript𝒜subscript𝑚𝑖subscript𝑚𝑖\lvert\mathcal{J}_{i}\rvert\leq\lvert\mathcal{A}_{m_{i}}\rvert=m_{i}| caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ | caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | = italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

For the lower bound on the size of 𝒥isubscript𝒥𝑖\mathcal{J}_{i}caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, note that Tri𝑇subscript𝑟𝑖Tr_{i}italic_T italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the expected number of times context i𝑖iitalic_i is visited by the algorithm. Therefore,

a𝒜miN(a,i)Trisubscript𝑎subscript𝒜subscript𝑚𝑖subscript𝑁𝑎𝑖𝑇subscript𝑟𝑖\displaystyle\sum_{a\in\mathcal{A}_{m_{i}}}N_{(a,i)}\leq Tr_{i}∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ≤ italic_T italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (8)

Furthermore, by definition, for each intervention b𝒜mi𝒥i𝑏subscript𝒜subscript𝑚𝑖subscript𝒥𝑖b\in\mathcal{A}_{m_{i}}\setminus\mathcal{J}_{i}italic_b ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∖ caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT we have N(b,i)2Trimisubscript𝑁𝑏𝑖2𝑇subscript𝑟𝑖subscript𝑚𝑖N_{(b,i)}\geq\frac{2Tr_{i}}{m_{i}}italic_N start_POSTSUBSCRIPT ( italic_b , italic_i ) end_POSTSUBSCRIPT ≥ divide start_ARG 2 italic_T italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG. Hence, assuming |𝒜mi𝒥i|>mi2subscript𝒜subscript𝑚𝑖subscript𝒥𝑖subscript𝑚𝑖2\lvert\mathcal{A}_{m_{i}}\setminus\mathcal{J}_{i}\rvert>\frac{m_{i}}{2}| caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∖ caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | > divide start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG would contradict inequality (8). This observation implies that |𝒜mi𝒥i|mi2subscript𝒜subscript𝑚𝑖subscript𝒥𝑖subscript𝑚𝑖2\lvert\mathcal{A}_{m_{i}}\setminus\mathcal{J}_{i}\rvert\leq\frac{m_{i}}{2}| caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∖ caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ divide start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG and, hence, |𝒥i|mi2subscript𝒥𝑖subscript𝑚𝑖2\lvert\mathcal{J}_{i}\rvert\geq\frac{m_{i}}{2}| caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≥ divide start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG. This completes the proof. ∎

Recall that T(a,i)subscript𝑇𝑎𝑖T_{(a,i)}italic_T start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT denotes the number of times intervention a𝑎aitalic_a is observed at context i𝑖iitalic_i. The following proposition bounds 𝔼[T(a,i)]𝔼delimited-[]subscript𝑇𝑎𝑖\mathbb{E}[T_{(a,i)}]blackboard_E [ italic_T start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ] for each intervention a𝒥i𝑎subscript𝒥𝑖a\in\mathcal{J}_{i}italic_a ∈ caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Proposition F.2.

For every intervention a𝒥i𝑎subscript𝒥𝑖a\in\mathcal{J}_{i}italic_a ∈ caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

𝔼𝒫0[T(a,i)]3Trimi.subscript𝔼subscript𝒫0delimited-[]subscript𝑇𝑎𝑖3𝑇subscript𝑟𝑖subscript𝑚𝑖\displaystyle\mathbb{E}_{\mathcal{P}_{0}}[T_{(a,i)}]\leq\frac{3Tr_{i}}{m_{i}}.blackboard_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_T start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ] ≤ divide start_ARG 3 italic_T italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG .
Proof.

Any intervention a𝒥i𝒜mi𝑎subscript𝒥𝑖subscript𝒜subscript𝑚𝑖a\in\mathcal{J}_{i}\subseteq\mathcal{A}_{m_{i}}italic_a ∈ caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊆ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT may be observed either when it is explicitly performed by the algorithm or as a random realization (under some other intervention, including do-nothing). Since a𝒜mi𝑎subscript𝒜subscript𝑚𝑖a\in\mathcal{A}_{m_{i}}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the probability that a𝑎aitalic_a is observed as part of some other intervention is at most 1mi1subscript𝑚𝑖\frac{1}{m_{i}}divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG. Therefore, the expected number of times that a𝑎aitalic_a is observed by the algorithm—without explicitly performing it—is at most Trimi𝑇subscript𝑟𝑖subscript𝑚𝑖\frac{Tr_{i}}{m_{i}}divide start_ARG italic_T italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG; 777Here, we use the fact that the realization of a𝑎aitalic_a is independent of the visitation of context i𝑖iitalic_i. recall that the expected number of times context i𝑖iitalic_i is visited is equal to Tri𝑇subscript𝑟𝑖Tr_{i}italic_T italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

For any intervention a𝒥i𝑎subscript𝒥𝑖a\in\mathcal{J}_{i}italic_a ∈ caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, by definition, the expected number of times a𝑎aitalic_a is performed N(a,i)2Trimisubscript𝑁𝑎𝑖2𝑇subscript𝑟𝑖subscript𝑚𝑖N_{(a,i)}\leq\frac{2Tr_{i}}{m_{i}}italic_N start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ≤ divide start_ARG 2 italic_T italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG. Therefore, the proposition follows:

𝔼[T(a,i)]Trimi+N(a,i)3Trimi.𝔼delimited-[]subscript𝑇𝑎𝑖𝑇subscript𝑟𝑖subscript𝑚𝑖subscript𝑁𝑎𝑖3𝑇subscript𝑟𝑖subscript𝑚𝑖\displaystyle\mathbb{E}[T_{(a,i)}]\leq\frac{Tr_{i}}{m_{i}}+N_{(a,i)}\leq\frac{% 3Tr_{i}}{m_{i}}.blackboard_E [ italic_T start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ] ≤ divide start_ARG italic_T italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG + italic_N start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT ≤ divide start_ARG 3 italic_T italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG .

We now state two known results for KL divergence.

Bretagnolle-Huber Inequality (Theorem 14.2 in Lattimore & Szepesvári (2020)) : Let 𝒫𝒫\mathcal{P}caligraphic_P and 𝒫superscript𝒫\mathcal{P}^{\prime}caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be any two measures on the same measurable space. Let E𝐸Eitalic_E be any event in the sample space with complement Ecsuperscript𝐸𝑐E^{c}italic_E start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. Then,

𝒫{E}+𝒫{Ec}12exp(KL(𝒫,𝒫)).subscript𝒫𝐸subscriptsuperscript𝒫superscript𝐸𝑐12KL𝒫superscript𝒫\displaystyle\mathbb{P}_{\mathcal{P}}\{E\}+\mathbb{P}_{\mathcal{P}^{\prime}}\{% E^{c}\}\geq\frac{1}{2}\exp\left(-\rm{KL}(\mathcal{P},\mathcal{P}^{\prime})% \right).blackboard_P start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT { italic_E } + blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { italic_E start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_exp ( - roman_KL ( caligraphic_P , caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) . (9)

Bound on KL-Divergence with number of observations (Adaptation of Equation 17 in Lemma B1 from Auer et al. (1995)): Let 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒫(a,i)subscript𝒫𝑎𝑖\mathcal{P}_{(a,i)}caligraphic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT be any two measures with differing expected rewards (for exactly the intervention a𝑎aitalic_a at context i𝑖iitalic_i) by an amount β𝛽\betaitalic_β. Then,

KL(𝒫0,𝒫(a,i))6β2𝔼𝒫0[T(a,i)]KLsubscript𝒫0subscript𝒫ai6superscript𝛽2subscript𝔼subscript𝒫0delimited-[]subscriptTai\displaystyle\rm{KL}(\mathcal{P}_{0},\mathcal{P}_{(a,i)})\leq 6\beta^{2}\ % \mathbb{E}_{\mathcal{P}_{0}}[T_{(a,i)}]roman_KL ( caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT ( roman_a , roman_i ) end_POSTSUBSCRIPT ) ≤ 6 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_T start_POSTSUBSCRIPT ( roman_a , roman_i ) end_POSTSUBSCRIPT ] (10)

Using this bound on KL divergence and Proposition F.2, we have, for all contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] and interventions a𝒥i𝑎subscript𝒥𝑖a\in\mathcal{J}_{i}italic_a ∈ caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

KL(𝒫0,𝒫(a,i))6β23Trimi=18Triβ2miKLsubscript𝒫0subscript𝒫ai6superscript𝛽23subscriptTrisubscriptmi18subscriptTrisuperscript𝛽2subscriptmi\displaystyle\rm{KL}(\mathcal{P}_{0},\mathcal{P}_{(a,i)})\leq 6\beta^{2}\cdot 3% \frac{Tr_{i}}{m_{i}}=18\frac{Tr_{i}\beta^{2}}{m_{i}}roman_KL ( caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT ( roman_a , roman_i ) end_POSTSUBSCRIPT ) ≤ 6 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ 3 divide start_ARG roman_Tr start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT end_ARG start_ARG roman_m start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT end_ARG = 18 divide start_ARG roman_Tr start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_m start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT end_ARG (11)

Substituting this in the Bretagnolle-Huber Inequality, we obtain, for any event E𝐸Eitalic_E in the sample space along with all contexts i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ] and all interventions a𝒥i𝑎subscript𝒥𝑖a\in\mathcal{J}_{i}italic_a ∈ caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

𝒫(a,i){E}+𝒫0{Ec}12exp(18Triβ2mi)subscriptsubscript𝒫𝑎𝑖𝐸subscriptsubscript𝒫0superscript𝐸𝑐1218𝑇subscript𝑟𝑖superscript𝛽2subscript𝑚𝑖\displaystyle\mathbb{P}_{\mathcal{P}_{(a,i)}}\{E\}+\mathbb{P}_{\mathcal{P}_{0}% }\{E^{c}\}\geq\frac{1}{2}\exp\left(-18\frac{Tr_{i}\beta^{2}}{m_{i}}\right)blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_E } + blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_E start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_exp ( - 18 divide start_ARG italic_T italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) (12)

We now define events to lower bound the probability that Algorithm 𝒜𝒜\mathscr{A}script_A returns a sub-optimal policy. In particular, write π^^𝜋\widehat{\pi}over^ start_ARG italic_π end_ARG to denote the policy returned by algorithm 𝒜𝒜\mathscr{A}script_A. Note that π^^𝜋\widehat{\pi}over^ start_ARG italic_π end_ARG is a random variable.

For any [k]delimited-[]𝑘\ell\in[k]roman_ℓ ∈ [ italic_k ] and any intervention b𝑏bitalic_b, write G1(b,)subscript𝐺1𝑏G_{1}(b,\ell)italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_b , roman_ℓ ) to denote the event that—under the returned policy π^^𝜋\widehat{\pi}over^ start_ARG italic_π end_ARG—intervention b𝑏bitalic_b is not chosen at context \ellroman_ℓ, i.e., G1(b,):={π^()b}assignsubscript𝐺1𝑏^𝜋𝑏G_{1}(b,\ell):=\left\{\widehat{\pi}(\ell)\neq b\right\}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_b , roman_ℓ ) := { over^ start_ARG italic_π end_ARG ( roman_ℓ ) ≠ italic_b }. Also, let G2()subscript𝐺2G_{2}(\ell)italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_ℓ ) denote the event that policy π^^𝜋\widehat{\pi}over^ start_ARG italic_π end_ARG does not induce a transition to \ellroman_ℓ from context 00, i.e., G2():={π^(0)}assignsubscript𝐺2^𝜋0G_{2}(\ell):=\left\{\widehat{\pi}(0)\neq\ell\right\}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_ℓ ) := { over^ start_ARG italic_π end_ARG ( 0 ) ≠ roman_ℓ }. Furthermore, write G(b,):=G1(b,)G2()assign𝐺𝑏subscript𝐺1𝑏subscript𝐺2G(b,\ell):=G_{1}(b,\ell)\cup G_{2}(\ell)italic_G ( italic_b , roman_ℓ ) := italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_b , roman_ℓ ) ∪ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_ℓ ). Note that the complement Gc(b,)=G1c(b,)G2c()={π^()=b}{π^(0)=}superscript𝐺𝑐𝑏subscriptsuperscript𝐺𝑐1𝑏subscriptsuperscript𝐺𝑐2^𝜋𝑏^𝜋0G^{c}(b,\ell)=G^{c}_{1}(b,\ell)\cap G^{c}_{2}(\ell)=\{\widehat{\pi}(\ell)=b\}% \cap\{\widehat{\pi}(0)=\ell\}italic_G start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_b , roman_ℓ ) = italic_G start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_b , roman_ℓ ) ∩ italic_G start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_ℓ ) = { over^ start_ARG italic_π end_ARG ( roman_ℓ ) = italic_b } ∩ { over^ start_ARG italic_π end_ARG ( 0 ) = roman_ℓ }.

Considering measure 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we note that for each context [k]delimited-[]𝑘\ell\in[k]roman_ℓ ∈ [ italic_k ] there exists an intervention α𝒥subscript𝛼subscript𝒥\alpha_{\ell}\in\mathcal{J}_{\ell}italic_α start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∈ caligraphic_J start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT with the property that 𝒫0{G1c(α,)}=𝒫0{π^()=α}1|J|subscriptsubscript𝒫0subscriptsuperscript𝐺𝑐1subscript𝛼subscriptsubscript𝒫0^𝜋subscript𝛼1subscript𝐽\mathbb{P}_{\mathcal{P}_{0}}\left\{G^{c}_{1}(\alpha_{\ell},\ell)\right\}=% \mathbb{P}_{\mathcal{P}_{0}}\left\{\widehat{\pi}(\ell)=\alpha_{\ell}\right\}% \leq\frac{1}{\lvert J_{\ell}\rvert}blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_G start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , roman_ℓ ) } = blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT { over^ start_ARG italic_π end_ARG ( roman_ℓ ) = italic_α start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT } ≤ divide start_ARG 1 end_ARG start_ARG | italic_J start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT | end_ARG. This follows from the fact that a𝒥𝒫0{π^()=a}1subscript𝑎subscript𝒥subscriptsubscript𝒫0^𝜋𝑎1\sum_{a\in\mathcal{J}_{\ell}}\mathbb{P}_{\mathcal{P}_{0}}\left\{\widehat{\pi}(% \ell)=a\right\}\leq 1∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_J start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT { over^ start_ARG italic_π end_ARG ( roman_ℓ ) = italic_a } ≤ 1. Therefore, for each context [k]delimited-[]𝑘\ell\in[k]roman_ℓ ∈ [ italic_k ] there exists an intervention αsubscript𝛼\alpha_{\ell}italic_α start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT such that 𝒫0{Gc(α,)}1|𝒥|subscriptsubscript𝒫0superscript𝐺𝑐subscript𝛼1subscript𝒥\mathbb{P}_{\mathcal{P}_{0}}\{G^{c}(\alpha_{\ell},\ell)\}\leq\frac{1}{\lvert% \mathcal{J}_{\ell}\rvert}blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_G start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , roman_ℓ ) } ≤ divide start_ARG 1 end_ARG start_ARG | caligraphic_J start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT | end_ARG.

This bound and inequality 12 imply that for all contexts [k]delimited-[]𝑘\ell\in[k]roman_ℓ ∈ [ italic_k ] there exists an intervention αsubscript𝛼\alpha_{\ell}italic_α start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT that satisfies

𝒫(α,){G(α,)}12exp(18Trβ2m)1|𝒥|subscriptsubscript𝒫subscript𝛼𝐺subscript𝛼1218𝑇subscript𝑟superscript𝛽2subscript𝑚1subscript𝒥\displaystyle\mathbb{P}_{\mathcal{P}_{(\alpha_{\ell},\ell)}}\{G(\alpha_{\ell},% \ell)\}\geq\frac{1}{2}\exp\left(-18\frac{Tr_{\ell}\beta^{2}}{m_{\ell}}\right)-% \frac{1}{\lvert\mathcal{J}_{\ell}\rvert}blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , roman_ℓ ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_G ( italic_α start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , roman_ℓ ) } ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_exp ( - 18 divide start_ARG italic_T italic_r start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG ) - divide start_ARG 1 end_ARG start_ARG | caligraphic_J start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT | end_ARG (13)

We will set

β=min{13,[k]m18T}𝛽13subscriptdelimited-[]𝑘subscript𝑚18𝑇\displaystyle\beta=\min\left\{\frac{1}{3},\sqrt{\frac{\sum_{\ell\in[k]}m_{\ell% }}{18T}}\right\}italic_β = roman_min { divide start_ARG 1 end_ARG start_ARG 3 end_ARG , square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG 18 italic_T end_ARG end_ARG } (14)

Therefore β𝛽\betaitalic_β takes value either [k]m18Tsubscriptdelimited-[]𝑘subscript𝑚18𝑇\sqrt{\frac{\sum_{\ell\in[k]}m_{\ell}}{18T}}square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG 18 italic_T end_ARG end_ARG or 1313\frac{1}{3}divide start_ARG 1 end_ARG start_ARG 3 end_ARG. We will address these over two separate cases.

Case 1: β=[k]m18T𝛽subscriptdelimited-[]𝑘subscript𝑚18𝑇\beta=\sqrt{\frac{\sum_{\ell\in[k]}m_{\ell}}{18T}}italic_β = square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG 18 italic_T end_ARG end_ARG.

We wish to substitute this β𝛽\betaitalic_β value in Equation 13. Towards this, we will state a proposition.

Proposition F.3.

There exists a context s[k]𝑠delimited-[]𝑘s\in[k]italic_s ∈ [ italic_k ] such that

ms18Trs[k]m18Tsubscript𝑚𝑠18𝑇subscript𝑟𝑠subscriptdelimited-[]𝑘subscript𝑚18𝑇\sqrt{\frac{m_{s}}{18Tr_{s}}}\geq\sqrt{\frac{\sum_{\ell\in[k]}m_{\ell}}{18T}}square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG 18 italic_T italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG end_ARG ≥ square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG 18 italic_T end_ARG end_ARG
Proof.

First, we note the following claim considering all vectors ρ={ρ1,,ρk}𝜌subscript𝜌1subscript𝜌𝑘\rho=\{\rho_{1},\dots,\rho_{k}\}italic_ρ = { italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ρ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } in the probability simplex ΔΔ\Deltaroman_Δ.

Claim F.1.

For any given set of integers m1,m2,,mksubscript𝑚1subscript𝑚2subscript𝑚𝑘m_{1},m_{2},\ldots,m_{k}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we have

min(ρ1,ρ2,,ρk)Δ(max[k]mρ)[k]msubscriptsubscript𝜌1subscript𝜌2subscript𝜌𝑘Δsubscriptdelimited-[]𝑘subscript𝑚subscript𝜌subscriptdelimited-[]𝑘subscript𝑚\displaystyle\min_{(\rho_{1},\rho_{2},\ldots,\rho_{k})\in\Delta}\ \left(\max_{% \ell\in[k]}\frac{m_{\ell}}{\rho_{\ell}}\right)\geq\sum_{\ell\in[k]}m_{\ell}roman_min start_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_ρ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∈ roman_Δ end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT divide start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG ) ≥ ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT
Proof.

Assume, towards a contradiction, that for all [k]delimited-[]𝑘\ell\in[k]roman_ℓ ∈ [ italic_k ], we have mρ<[k]msubscript𝑚subscript𝜌subscriptdelimited-[]𝑘subscript𝑚\frac{m_{\ell}}{\rho_{\ell}}<\sum_{\ell\in[k]}m_{\ell}divide start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG < ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT. Then, ρ>m[k]msubscript𝜌subscript𝑚subscriptdelimited-[]𝑘subscript𝑚\rho_{\ell}>\frac{m_{\ell}}{\sum_{\ell\in[k]}m_{\ell}}italic_ρ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT > divide start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG, for all [k]delimited-[]𝑘\ell\in[k]roman_ℓ ∈ [ italic_k ]. Therefore, [k]ρ>[k]m[k]m=1subscriptdelimited-[]𝑘subscript𝜌subscriptdelimited-[]𝑘subscript𝑚subscriptdelimited-[]𝑘subscript𝑚1\sum_{\ell\in[k]}\rho_{\ell}>\sum_{\ell\in[k]}\frac{m_{\ell}}{\sum_{\ell\in[k]% }m_{\ell}}=1∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT > ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT divide start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG = 1. However, this is a contradiction as [k]ρ=1subscriptdelimited-[]𝑘subscript𝜌1\sum_{\ell\in[k]}\rho_{\ell}=1∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = 1. ∎

An immediate consequence of Claim F.1 is that

min(r1,r2,,rk)Δ(max[k]m18Tr)[k]m18Tsubscriptsubscript𝑟1subscript𝑟2subscript𝑟𝑘Δsubscriptdelimited-[]𝑘subscript𝑚18𝑇subscript𝑟subscriptdelimited-[]𝑘subscript𝑚18𝑇\min_{(r_{1},r_{2},\ldots,r_{k})\in\Delta}\left(\max_{\ell\in[k]}\sqrt{\frac{m% _{\ell}}{18Tr_{\ell}}}\right)\geq\sqrt{\frac{\sum_{\ell\in[k]}m_{\ell}}{18T}}roman_min start_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∈ roman_Δ end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG 18 italic_T italic_r start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG end_ARG ) ≥ square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG 18 italic_T end_ARG end_ARG

.

Therefore, irrespective of how risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs are chosen, there always exists a context s[k]𝑠delimited-[]𝑘s\in[k]italic_s ∈ [ italic_k ] such that ms18Trs[k]m18Tsubscript𝑚𝑠18𝑇subscript𝑟𝑠subscriptdelimited-[]𝑘subscript𝑚18𝑇\sqrt{\frac{m_{s}}{18Tr_{s}}}\geq\sqrt{\frac{\sum_{\ell\in[k]}m_{\ell}}{18T}}square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG 18 italic_T italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG end_ARG ≥ square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG 18 italic_T end_ARG end_ARG. ∎

For such a context s[k]𝑠delimited-[]𝑘s\in[k]italic_s ∈ [ italic_k ] that satisfies Proposition F.3, we note that, ms18Trsβ2subscript𝑚𝑠18𝑇subscript𝑟𝑠superscript𝛽2\frac{m_{s}}{18Tr_{s}}\geq\beta^{2}divide start_ARG italic_m start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG 18 italic_T italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ≥ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT or 18Trsβ2ms118𝑇subscript𝑟𝑠superscript𝛽2subscript𝑚𝑠1\frac{18Tr_{s}\beta^{2}}{m_{s}}\leq 1divide start_ARG 18 italic_T italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ≤ 1.

Let us now restate Equation 13 for such a context s𝑠sitalic_s. There exists a context s[k]𝑠delimited-[]𝑘s\in[k]italic_s ∈ [ italic_k ] and an intervention αssubscript𝛼𝑠\alpha_{s}italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT that satisfies

𝒫(αs,s){G(αs,s)}12exp(18Trsβ2ms)1|𝒥s|12e1|𝒥s|subscriptsubscript𝒫subscript𝛼𝑠𝑠𝐺subscript𝛼𝑠𝑠1218𝑇subscript𝑟𝑠superscript𝛽2subscript𝑚𝑠1subscript𝒥𝑠12𝑒1subscript𝒥𝑠\displaystyle\mathbb{P}_{\mathcal{P}_{(\alpha_{s},s)}}\{G(\alpha_{s},s)\}\geq% \frac{1}{2}\exp\left(-18\frac{Tr_{s}\beta^{2}}{m_{s}}\right)-\frac{1}{\lvert% \mathcal{J}_{s}\rvert}\geq\frac{1}{2e}-\frac{1}{\lvert\mathcal{J}_{s}\rvert}blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_G ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) } ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_exp ( - 18 divide start_ARG italic_T italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ) - divide start_ARG 1 end_ARG start_ARG | caligraphic_J start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG ≥ divide start_ARG 1 end_ARG start_ARG 2 italic_e end_ARG - divide start_ARG 1 end_ARG start_ARG | caligraphic_J start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG (15)

Note that the last inequality lower bounds the to probability of selecting a non-optimal policy when the algorithm 𝒜𝒜\mathscr{A}script_A is executed on instance αs,ssubscriptsubscript𝛼𝑠𝑠\mathcal{F}_{\alpha_{s},s}caligraphic_F start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s end_POSTSUBSCRIPT. Furthermore, in instance αs,ssubscriptsubscript𝛼𝑠𝑠\mathcal{F}_{\alpha_{s},s}caligraphic_F start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s end_POSTSUBSCRIPT, for any non-optimal policy π^^𝜋\widehat{\pi}over^ start_ARG italic_π end_ARG we have ε(π^)=(12+β)12=β𝜀^𝜋12𝛽12𝛽\varepsilon(\widehat{\pi})=\left(\frac{1}{2}+\beta\right)-\frac{1}{2}=\betaitalic_ε ( over^ start_ARG italic_π end_ARG ) = ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_β ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG = italic_β. Therefore, we can lower bound 𝒜𝒜\mathscr{A}script_A’s regret over instance αs,ssubscriptsubscript𝛼𝑠𝑠\mathcal{F}_{\alpha_{s},s}caligraphic_F start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s end_POSTSUBSCRIPT as follows:

RegretT=𝔼[ε(π^)]subscriptRegret𝑇𝔼delimited-[]𝜀^𝜋\displaystyle{\rm Regret}_{T}=\mathbb{E}[\varepsilon(\widehat{\pi})]roman_Regret start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = blackboard_E [ italic_ε ( over^ start_ARG italic_π end_ARG ) ] =𝒫(αs,s){G(αs,s)}𝔼[RegretG(αs,s)]+absentlimit-fromsubscriptsubscript𝒫subscript𝛼𝑠𝑠𝐺subscript𝛼𝑠𝑠𝔼delimited-[]conditionalRegret𝐺subscript𝛼𝑠𝑠\displaystyle=\mathbb{P}_{\mathcal{P}_{(\alpha_{s},s)}}\{G(\alpha_{s},s)\}% \cdot\mathbb{E}[{\rm Regret}\mid G(\alpha_{s},s)]\enspace+\enspace= blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_G ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) } ⋅ blackboard_E [ roman_Regret ∣ italic_G ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) ] + (16)
𝒫(αs,s){Gc(αs,s)}𝔼[RegretGc(αs,s)]subscriptsubscript𝒫subscript𝛼𝑠𝑠superscript𝐺𝑐subscript𝛼𝑠𝑠𝔼delimited-[]conditionalRegretsuperscript𝐺𝑐subscript𝛼𝑠𝑠\displaystyle\qquad\qquad\mathbb{P}_{\mathcal{P}_{(\alpha_{s},s)}}\{G^{c}(% \alpha_{s},s)\}\cdot\mathbb{E}[{\rm Regret}\mid G^{c}(\alpha_{s},s)]blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_G start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) } ⋅ blackboard_E [ roman_Regret ∣ italic_G start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) ]
[12e1|𝒥s|]β+𝒫(αs,s){Gc(αs,s)}0absentdelimited-[]12𝑒1subscript𝒥𝑠𝛽subscriptsubscript𝒫subscript𝛼𝑠𝑠superscript𝐺𝑐subscript𝛼𝑠𝑠0\displaystyle\geq\left[\frac{1}{2e}-\frac{1}{\lvert\mathcal{J}_{s}\rvert}% \right]\beta+\enspace\mathbb{P}_{\mathcal{P}_{(\alpha_{s},s)}}\{G^{c}(\alpha_{% s},s)\}\cdot 0≥ [ divide start_ARG 1 end_ARG start_ARG 2 italic_e end_ARG - divide start_ARG 1 end_ARG start_ARG | caligraphic_J start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG ] italic_β + blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_G start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) } ⋅ 0
=[12e1|𝒥s|]βabsentdelimited-[]12𝑒1subscript𝒥𝑠𝛽\displaystyle=\left[\frac{1}{2e}-\frac{1}{\lvert\mathcal{J}_{s}\rvert}\right]\beta= [ divide start_ARG 1 end_ARG start_ARG 2 italic_e end_ARG - divide start_ARG 1 end_ARG start_ARG | caligraphic_J start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG ] italic_β (17)

Note that we can construct the instances to ensure that m8subscript𝑚8m_{\ell}\geq 8italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ≥ 8, for all contexts \ellroman_ℓ, and, hence, (12e1|𝒥i|)=Ω(1)12𝑒1subscript𝒥𝑖Ω1\left(\frac{1}{2e}-\frac{1}{\lvert\mathcal{J}_{i}\rvert}\right)=\Omega(1)( divide start_ARG 1 end_ARG start_ARG 2 italic_e end_ARG - divide start_ARG 1 end_ARG start_ARG | caligraphic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ) = roman_Ω ( 1 ) (see Proposition F.1). Therefore Equation 17 gives us:

RegretTsubscriptRegret𝑇\displaystyle{\rm Regret}_{T}roman_Regret start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT =Ω(β)=Ω([k]mT)absentΩ𝛽Ωsubscriptdelimited-[]𝑘subscript𝑚𝑇\displaystyle=\Omega(\beta)=\Omega\left(\sqrt{\frac{\sum_{\ell\in[k]}m_{\ell}}% {T}}\right)= roman_Ω ( italic_β ) = roman_Ω ( square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG end_ARG ) (18)

Case 2 We now consider the case when β=13𝛽13\beta=\frac{1}{3}italic_β = divide start_ARG 1 end_ARG start_ARG 3 end_ARG. In such a case, [k]m18T>13subscriptdelimited-[]𝑘subscript𝑚18𝑇13\sqrt{\frac{\sum_{\ell\in[k]}m_{\ell}}{18T}}>\frac{1}{3}square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG 18 italic_T end_ARG end_ARG > divide start_ARG 1 end_ARG start_ARG 3 end_ARG.

We showed in Proposition F.3 that there exists a context s[k]𝑠delimited-[]𝑘s\in[k]italic_s ∈ [ italic_k ] such that ms18Trs[k]m18Tsubscript𝑚𝑠18𝑇subscript𝑟𝑠subscriptdelimited-[]𝑘subscript𝑚18𝑇\sqrt{\frac{m_{s}}{18Tr_{s}}}\geq\sqrt{\frac{\sum_{\ell\in[k]}m_{\ell}}{18T}}square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG 18 italic_T italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG end_ARG ≥ square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG 18 italic_T end_ARG end_ARG. Combining the two statements, there exists a context s𝑠sitalic_s such that ms18Trs13subscript𝑚𝑠18𝑇subscript𝑟𝑠13\sqrt{\frac{m_{s}}{18Tr_{s}}}\geq\frac{1}{3}square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG 18 italic_T italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG end_ARG ≥ divide start_ARG 1 end_ARG start_ARG 3 end_ARG. We now restate Inequality 13 for such a context s[k]𝑠delimited-[]𝑘s\in[k]italic_s ∈ [ italic_k ]:

𝒫(αs,s){G(αs,s)}12exp(9β2)1|𝒥s|=12e1|𝒥s|subscriptsubscript𝒫subscript𝛼𝑠𝑠𝐺subscript𝛼𝑠𝑠129superscript𝛽21subscript𝒥𝑠12𝑒1subscript𝒥𝑠\mathbb{P}_{\mathcal{P}_{(\alpha_{s},s)}}\{G(\alpha_{s},s)\}\geq\frac{1}{2}% \exp\left(-9\beta^{2}\right)-\frac{1}{\lvert\mathcal{J}_{s}\rvert}=\frac{1}{2e% }-\frac{1}{\lvert\mathcal{J}_{s}\rvert}blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_G ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) } ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_exp ( - 9 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG | caligraphic_J start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG = divide start_ARG 1 end_ARG start_ARG 2 italic_e end_ARG - divide start_ARG 1 end_ARG start_ARG | caligraphic_J start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG

Following the exact same procedure as in Case 1, we can derive that RegretT[12e1|𝒥s|]βsubscriptRegret𝑇delimited-[]12𝑒1subscript𝒥𝑠𝛽{\rm Regret}_{T}\geq\left[\frac{1}{2e}-\frac{1}{\lvert\mathcal{J}_{s}\rvert}% \right]\betaroman_Regret start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≥ [ divide start_ARG 1 end_ARG start_ARG 2 italic_e end_ARG - divide start_ARG 1 end_ARG start_ARG | caligraphic_J start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG ] italic_β. We saw in Case 1 that it is possible to construct instances such that [12e1|𝒥s|]=Ω(1)delimited-[]12𝑒1subscript𝒥𝑠Ω1\left[\frac{1}{2e}-\frac{1}{\lvert\mathcal{J}_{s}\rvert}\right]=\Omega(1)[ divide start_ARG 1 end_ARG start_ARG 2 italic_e end_ARG - divide start_ARG 1 end_ARG start_ARG | caligraphic_J start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG ] = roman_Ω ( 1 ). Therefore the following holds for Case 2 also:

RegretTsubscriptRegret𝑇\displaystyle{\rm Regret}_{T}roman_Regret start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT =Ω(β)=Ω([k]mT)absentΩ𝛽Ωsubscriptdelimited-[]𝑘subscript𝑚𝑇\displaystyle=\Omega(\beta)=\Omega\left(\sqrt{\frac{\sum_{\ell\in[k]}m_{\ell}}% {T}}\right)= roman_Ω ( italic_β ) = roman_Ω ( square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG end_ARG ) (19)

Inequalities 18 and 19 imply that there exists a context s𝑠sitalic_s and an intervention αssubscript𝛼𝑠\alpha_{s}italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT such that, under instance (αs,s)subscriptsubscript𝛼𝑠𝑠\mathcal{F}_{(\alpha_{s},s)}caligraphic_F start_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) end_POSTSUBSCRIPT, algorithm 𝒜𝒜\mathscr{A}script_A’s regret satisfies

RegretT=Ω([k]mT)subscriptRegret𝑇Ωsubscriptdelimited-[]𝑘subscript𝑚𝑇\displaystyle{\rm Regret}_{T}=\Omega\left(\sqrt{\frac{\sum_{\ell\in[k]}m_{\ell% }}{T}}\right)roman_Regret start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = roman_Ω ( square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG end_ARG ) (20)

We complete the proof of Theorem 2 by showing that in the current context λ=[k]m𝜆subscriptdelimited-[]𝑘subscript𝑚\lambda=\sum_{\ell\in[k]}m_{\ell}italic_λ = ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT.

Proposition F.4.

For the chosen transition matrix

λ:=minfq. vectorfPM1/2(Pf)122=[k]m\lambda:=\min_{\text{fq.~{}vector}f}\ \left\lVert PM^{1/2}\left(P^{\top}f% \right)^{\circ-\frac{1}{2}}\right\rVert_{\infty}^{2}=\sum_{\ell\in[k]}m_{\ell}italic_λ := roman_min start_POSTSUBSCRIPT fq. vector italic_f end_POSTSUBSCRIPT ∥ italic_P italic_M start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ) start_POSTSUPERSCRIPT ∘ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT
Proof.

Recall that all the instances, 0subscript0\mathcal{F}_{0}caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and (a,i)subscript𝑎𝑖\mathcal{F}_{(a,i)}caligraphic_F start_POSTSUBSCRIPT ( italic_a , italic_i ) end_POSTSUBSCRIPTs, have the same (deterministic) transition matrix P𝑃Pitalic_P. Also, parameter λ𝜆\lambdaitalic_λ is computed via Equation 3.

Consider any frequency vector f𝑓fitalic_f over the interventions {1,,N}1𝑁\{1,\dots,N\}{ 1 , … , italic_N }. From the chosen transition matrix, we have the following:

P=[100010001001]𝑃matrix100010missing-subexpressionmissing-subexpression001missing-subexpressionmissing-subexpression001\displaystyle P=\begin{bmatrix}1&0&\dots&0\\ 0&1&\dots&0\\ &&\dots\\ 0&0&\dots&1\\ &&\dots\\ 0&0&\dots&1\\ \end{bmatrix}italic_P = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL … end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL … end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL … end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL … end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL … end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL … end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] PM12=[m1000m2000mk00mk]𝑃superscript𝑀12matrixsubscript𝑚1000subscript𝑚20missing-subexpressionmissing-subexpression00subscript𝑚𝑘missing-subexpressionmissing-subexpression00subscript𝑚𝑘\displaystyle\quad PM^{\frac{1}{2}}=\begin{bmatrix}\sqrt{m_{1}}&0&\dots&0\\ 0&\sqrt{m_{2}}&\dots&0\\ &&\dots\\ 0&0&\dots&\sqrt{m_{k}}\\ &&\dots\\ 0&0&\dots&\sqrt{m_{k}}\\ \end{bmatrix}italic_P italic_M start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL square-root start_ARG italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_CELL start_CELL 0 end_CELL start_CELL … end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL square-root start_ARG italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG end_CELL start_CELL … end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL … end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL … end_CELL start_CELL square-root start_ARG italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL … end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL … end_CELL start_CELL square-root start_ARG italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_CELL end_ROW end_ARG ] Pf=[f1f2fk1fk++fN]superscript𝑃top𝑓matrixsubscript𝑓1subscript𝑓2subscript𝑓𝑘1subscript𝑓𝑘subscript𝑓𝑁\displaystyle\quad P^{\top}f=\begin{bmatrix}f_{1}\\ f_{2}\\ \dots\\ f_{k-1}\\ f_{k}+\ldots+f_{N}\\ \end{bmatrix}italic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f = [ start_ARG start_ROW start_CELL italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL … end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + … + italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

From here, we can compute the following:

PM1/2(Pf)12=[m1f1,,mk1fk1,mkfk++fN,,mkfk++fN]𝑃superscript𝑀12superscriptsuperscript𝑃top𝑓absent12superscriptsubscript𝑚1subscript𝑓1subscript𝑚𝑘1subscript𝑓𝑘1subscript𝑚𝑘subscript𝑓𝑘subscript𝑓𝑁subscript𝑚𝑘subscript𝑓𝑘subscript𝑓𝑁top\displaystyle PM^{1/2}\left(P^{\top}f\right)^{\circ-\frac{1}{2}}=\left[\sqrt{% \frac{m_{1}}{f_{1}}},\dots,\sqrt{\frac{m_{k-1}}{f_{k-1}}},\sqrt{\frac{m_{k}}{f% _{k}+\ldots+f_{N}}},\dots,\sqrt{\frac{m_{k}}{f_{k}+\ldots+f_{N}}}\right]^{\top}italic_P italic_M start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ) start_POSTSUPERSCRIPT ∘ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT = [ square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG , … , square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG end_ARG , square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + … + italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG end_ARG , … , square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + … + italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT

That is, for all [k1]delimited-[]𝑘1\ell\in[k-1]roman_ℓ ∈ [ italic_k - 1 ], the \ellroman_ℓth component of the vector PM1/2(Pf)12𝑃superscript𝑀12superscriptsuperscript𝑃top𝑓absent12PM^{1/2}\left(P^{\top}f\right)^{\circ-\frac{1}{2}}italic_P italic_M start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ) start_POSTSUPERSCRIPT ∘ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT is equal to mifisubscript𝑚𝑖subscript𝑓𝑖\sqrt{\frac{m_{i}}{f_{i}}}square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG. All the remaining components are mkfk++fNsubscript𝑚𝑘subscript𝑓𝑘subscript𝑓𝑁\sqrt{\frac{m_{k}}{f_{k}+\ldots+f_{N}}}square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + … + italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG end_ARG.

Write ρ:=fassignsubscript𝜌subscript𝑓\rho_{\ell}:=f_{\ell}italic_ρ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT := italic_f start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT for all [k1]delimited-[]𝑘1\ell\in[k-1]roman_ℓ ∈ [ italic_k - 1 ] and ρk=j=kNfjsubscript𝜌𝑘superscriptsubscript𝑗𝑘𝑁subscript𝑓𝑗\rho_{k}=\sum_{j=k}^{N}f_{j}italic_ρ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Since f𝑓fitalic_f is a frequency vector, (ρ1,ρk)Δsubscript𝜌1subscript𝜌𝑘Δ(\rho_{1},\ldots\rho_{k})\in\Delta( italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_ρ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∈ roman_Δ. In addition,

PM1/2(Pf)12=[m1ρ1,,mk1ρk1,mkρk,,mkρk]𝑃superscript𝑀12superscriptsuperscript𝑃top𝑓absent12superscriptsubscript𝑚1subscript𝜌1subscript𝑚𝑘1subscript𝜌𝑘1subscript𝑚𝑘subscript𝜌𝑘subscript𝑚𝑘subscript𝜌𝑘topPM^{1/2}\left(P^{\top}f\right)^{\circ-\frac{1}{2}}=\left[\sqrt{\frac{m_{1}}{% \rho_{1}}},\dots,\sqrt{\frac{m_{k-1}}{\rho_{k-1}}},\sqrt{\frac{m_{k}}{\rho_{k}% }},\dots,\sqrt{\frac{m_{k}}{\rho_{k}}}\right]^{\top}italic_P italic_M start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ) start_POSTSUPERSCRIPT ∘ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT = [ square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG , … , square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG end_ARG , square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG , … , square-root start_ARG divide start_ARG italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT

Therefore, by definition, λ=min(ρ1,,ρk)Δ(max[k]mρ)𝜆subscriptsubscript𝜌1subscript𝜌𝑘Δsubscriptdelimited-[]𝑘subscript𝑚subscript𝜌\lambda=\min_{(\rho_{1},\ldots,\rho_{k})\in\Delta}\left(\max_{\ell\in[k]}{% \frac{m_{\ell}}{\rho_{\ell}}}\right)italic_λ = roman_min start_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ρ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∈ roman_Δ end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT divide start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG ). Now, using a complementary form of Claim F.1 we obtain λ=[k]m𝜆subscriptdelimited-[]𝑘subscript𝑚\lambda={\sum_{\ell\in[k]}m_{\ell}}italic_λ = ∑ start_POSTSUBSCRIPT roman_ℓ ∈ [ italic_k ] end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT. The proposition stands proved.

Finally, substituting Proposition F.4 into Equation 20, we obtain that there exists an instance (αs,s)subscriptsubscript𝛼𝑠𝑠\mathcal{F}_{(\alpha_{s},s)}caligraphic_F start_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) end_POSTSUBSCRIPT for which algorithm 𝒜𝒜\mathscr{A}script_A’s regret is lower bounded as follows

RegretT=Ω(λT).subscriptRegret𝑇Ω𝜆𝑇\displaystyle{\rm Regret}_{T}=\Omega\left(\sqrt{\frac{\lambda}{T}}\right).roman_Regret start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = roman_Ω ( square-root start_ARG divide start_ARG italic_λ end_ARG start_ARG italic_T end_ARG end_ARG ) . (21)

This completes the proof of Theorem 2.

F.2 Proof of Inequality (10)

For completeness, we provide a proof of inequality (10).

Lemma 12.

KL(𝒫0,𝒫(a,i))6βi2𝔼𝒫0[T(a,i)]KLsubscript𝒫0subscript𝒫ai6superscriptsubscript𝛽i2subscript𝔼subscript𝒫0delimited-[]subscriptTai\rm{KL}(\mathcal{P}_{0},\mathcal{P}_{(a,i)})\leq 6\beta_{i}^{2}\ \mathbb{E}_{% \mathcal{P}_{0}}[T_{(a,i)}]roman_KL ( caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT ( roman_a , roman_i ) end_POSTSUBSCRIPT ) ≤ 6 italic_β start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_T start_POSTSUBSCRIPT ( roman_a , roman_i ) end_POSTSUBSCRIPT ]

Proof of Inequality (10).

This proof is based on lemma B1 in Auer et al. (1995). We define a couple of notations for this proof. Let 𝐑t1subscript𝐑𝑡1\mathbf{R}_{t-1}bold_R start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT indicate the filtration (of rewards and other observations) up to time t1𝑡1t-1italic_t - 1. and Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicate the reward at time t𝑡titalic_t for this proof.

KL(𝒫0,𝒫(a,i))KLsubscript𝒫0subscript𝒫ai\displaystyle\rm{KL}(\mathcal{P}_{0},\mathcal{P}_{(a,i)})roman_KL ( caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT ( roman_a , roman_i ) end_POSTSUBSCRIPT ) =KL[𝒫0(RT,RT1,,R1)𝒫(a,i)(RT,RT1,,R1)]absentKLdelimited-[]subscriptsubscript𝒫0subscriptRTsubscriptRT1subscriptR1subscriptsubscript𝒫aisubscriptRTsubscriptRT1subscriptR1\displaystyle=\rm{KL}\left[\mathbb{P}_{\mathcal{P}_{0}}(R_{T},R_{T-1},\dots,R_% {1})\mathrel{\|}\mathbb{P}_{\mathcal{P}_{(a,i)}}(R_{T},R_{T-1},\dots,R_{1})\right]= roman_KL [ blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_R start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT , roman_R start_POSTSUBSCRIPT roman_T - 1 end_POSTSUBSCRIPT , … , roman_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT ( roman_a , roman_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_R start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT , roman_R start_POSTSUBSCRIPT roman_T - 1 end_POSTSUBSCRIPT , … , roman_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ]

We now state (without proof) a useful lemma for bounding the KL divergence between random variables over a number of observations.

Chain Rule for entropy (Theorem 2.5.1 in Cover & Thomas (2006)): Let X1,,XTsubscript𝑋1subscript𝑋𝑇X_{1},\dots,X_{T}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT be random variables drawn according to P1,,PTsubscript𝑃1subscript𝑃𝑇P_{1},\dots,P_{T}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Then

H(X1,X2,,XT)=i=1TH(XiXi1,Xi2,,X1)𝐻subscript𝑋1subscript𝑋2subscript𝑋𝑇superscriptsubscript𝑖1𝑇𝐻conditionalsubscript𝑋𝑖subscript𝑋𝑖1subscript𝑋𝑖2subscript𝑋1H(X_{1},X_{2},\dots,X_{T})=\sum_{i=1}^{T}H(X_{i}\mid X_{i-1},X_{i-2},\dots,X_{% 1})italic_H ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_H ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )

where H()𝐻H(\cdot)italic_H ( ⋅ ) is the entropy associated with the random variables.

Using the chain rule for entropy

KL(𝒫0,𝒫(a,i))KLsubscript𝒫0subscript𝒫ai\displaystyle\rm{KL}(\mathcal{P}_{0},\mathcal{P}_{(a,i)})roman_KL ( caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT ( roman_a , roman_i ) end_POSTSUBSCRIPT ) =t=1TKL[𝒫0(Rt𝐑t1)𝒫(a,i)(Rt𝐑t1)]absentsuperscriptsubscript𝑡1𝑇KLdelimited-[]subscriptsubscript𝒫0conditionalsubscriptRtsubscript𝐑t1subscriptsubscript𝒫aiconditionalsubscriptRtsubscript𝐑t1\displaystyle=\sum\limits_{t=1}^{T}\rm{KL}\left[\mathbb{P}_{\mathcal{P}_{0}}(R% _{t}\mid\mathbf{R}_{t-1})\mathrel{\|}\mathbb{P}_{\mathcal{P}_{(a,i)}}(R_{t}% \mid\mathbf{R}_{t-1})\right]= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_KL [ blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_R start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ∣ bold_R start_POSTSUBSCRIPT roman_t - 1 end_POSTSUBSCRIPT ) ∥ blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT ( roman_a , roman_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_R start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ∣ bold_R start_POSTSUBSCRIPT roman_t - 1 end_POSTSUBSCRIPT ) ]
Let atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the intervention chosen by the Algorithm 𝒜𝒜\mathscr{A}script_A at time t𝑡titalic_t. Then:
=t=1T𝒫0{ata𝐑t1}(1212)+𝒫0{at=a𝐑t1}KL(1212+βi)absentsuperscriptsubscript𝑡1𝑇subscriptsubscript𝒫0conditional-setsubscript𝑎𝑡𝑎subscript𝐑𝑡11212subscriptsubscript𝒫0conditional-setsubscript𝑎𝑡𝑎subscript𝐑𝑡1KL1212subscript𝛽i\displaystyle=\sum\limits_{t=1}^{T}\mathbb{P}_{\mathcal{P}_{0}}\{a_{t}\neq a% \mid\mathbf{R}_{t-1}\}\left(\frac{1}{2}\mathrel{\|}\frac{1}{2}\right)+\mathbb{% P}_{\mathcal{P}_{0}}\{a_{t}=a\mid\mathbf{R}_{t-1}\}\rm{KL}\left(\frac{1}{2}% \mathrel{\|}\frac{1}{2}+\beta_{i}\right)= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ italic_a ∣ bold_R start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) + blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ∣ bold_R start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } roman_KL ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_β start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT )
Since KL(1212)=0KL12120\rm{KL}\left(\frac{1}{2}\mathrel{\|}\frac{1}{2}\right)=0roman_KL ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) = 0, we get:
=t=1T𝒫0{at=a𝐑t1}KL(1212+βi)absentsuperscriptsubscript𝑡1𝑇subscriptsubscript𝒫0conditional-setsubscript𝑎𝑡𝑎subscript𝐑𝑡1KL1212subscript𝛽i\displaystyle=\sum\limits_{t=1}^{T}\mathbb{P}_{\mathcal{P}_{0}}\{a_{t}=a\mid% \mathbf{R}_{t-1}\}\rm{KL}\left(\frac{1}{2}\mathrel{\|}\frac{1}{2}+\beta_{i}\right)= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ∣ bold_R start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } roman_KL ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_β start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT )
=KL(1212+βi)t=1T𝒫0{at=a𝐑t1}absentKL1212subscript𝛽isuperscriptsubscriptt1Tsubscriptsubscript𝒫0conditional-setsubscriptatasubscript𝐑t1\displaystyle=\rm{KL}\left(\frac{1}{2}\mathrel{\|}\frac{1}{2}+\beta_{i}\right)% \sum\limits_{t=1}^{T}\mathbb{P}_{\mathcal{P}_{0}}\{a_{t}=a\mid\mathbf{R}_{t-1}\}= roman_KL ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_β start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT roman_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT blackboard_P start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT { roman_a start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT = roman_a ∣ bold_R start_POSTSUBSCRIPT roman_t - 1 end_POSTSUBSCRIPT }
=KL(1212+βi)𝔼𝒫0[T(a,i)]absentKL1212subscript𝛽isubscript𝔼subscript𝒫0delimited-[]subscriptTai\displaystyle=\rm{KL}\left(\frac{1}{2}\mathrel{\|}\frac{1}{2}+\beta_{i}\right)% \mathbb{E}_{\mathcal{P}_{0}}[T_{(a,i)}]= roman_KL ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_β start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ) blackboard_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_T start_POSTSUBSCRIPT ( roman_a , roman_i ) end_POSTSUBSCRIPT ]
Claim F.2.

KL(1212+βi)=12log2(14βi2)6βi2KL1212subscript𝛽i12subscript214superscriptsubscript𝛽i26superscriptsubscript𝛽i2\rm{KL}\left(\frac{1}{2}\mathrel{\|}\frac{1}{2}+\beta_{i}\right)=-\frac{1}{2}% \log_{2}(1-4\beta_{i}^{2})\leq 6\beta_{i}^{2}roman_KL ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_β start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 - 4 italic_β start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ 6 italic_β start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

Proof.
KL(1212+βi)KL1212subscript𝛽i\displaystyle\rm{KL}\left(\frac{1}{2}\mathrel{\|}\frac{1}{2}+\beta_{i}\right)roman_KL ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_β start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ) =12log2[1212+βi]+(112)log2[(112)(112βi)]absent12subscript21212subscript𝛽𝑖112subscript2112112subscript𝛽𝑖\displaystyle=\frac{1}{2}\log_{2}\left[\frac{\frac{1}{2}}{\frac{1}{2}+\beta_{i% }}\right]+(1-\frac{1}{2})\log_{2}\left[\frac{(1-\frac{1}{2})}{(1-\frac{1}{2}-% \beta_{i})}\right]= divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ divide start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ] + ( 1 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ divide start_ARG ( 1 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG ( 1 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ]
=12log2[11+2βi]+12log2[112βi]absent12subscript2112subscript𝛽𝑖12subscript2112subscript𝛽𝑖\displaystyle=\frac{1}{2}\log_{2}\left[\frac{1}{1+2\beta_{i}}\right]+\frac{1}{% 2}\log_{2}\left[\frac{1}{1-2\beta_{i}}\right]= divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 1 + 2 italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 1 - 2 italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ]
=12log2[114βi2]=12log2[14βi2]absent12subscript2114superscriptsubscript𝛽𝑖212subscript214superscriptsubscript𝛽𝑖2\displaystyle=\frac{1}{2}\log_{2}\left[\frac{1}{1-4\beta_{i}^{2}}\right]=-% \frac{1}{2}\log_{2}\left[1-4\beta_{i}^{2}\right]= divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 1 - 4 italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ 1 - 4 italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=12ln(2)ln[14βi2]4βi22ln(2)<6βi2absent12214superscriptsubscript𝛽𝑖24superscriptsubscript𝛽𝑖2226superscriptsubscript𝛽𝑖2\displaystyle=-\frac{1}{2\ln(2)}\ln\left[1-4\beta_{i}^{2}\right]\leq\frac{4% \beta_{i}^{2}}{2\ln(2)}<6\beta_{i}^{2}= - divide start_ARG 1 end_ARG start_ARG 2 roman_ln ( 2 ) end_ARG roman_ln [ 1 - 4 italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ divide start_ARG 4 italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 roman_ln ( 2 ) end_ARG < 6 italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where the last inequality is obtained from the Taylor series expansion of the log\logroman_log. ∎

It follows that: KL(0,1)6βi2𝔼𝒫0[T(a,i)]KLsubscript0subscript16superscriptsubscript𝛽i2subscript𝔼subscript𝒫0delimited-[]subscriptTai\rm{KL}(\mathbb{P}_{0},\mathbb{P}_{1})\leq 6\beta_{i}^{2}\mathbb{E}_{\mathcal{% P}_{0}}[T_{(a,i)}]roman_KL ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≤ 6 italic_β start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_T start_POSTSUBSCRIPT ( roman_a , roman_i ) end_POSTSUBSCRIPT ]. ∎