Causal Bandits: The Pareto Optimal Frontier of Adaptivity, a Reduction to Linear Bandits, and Limitations around Unknown Marginals

Ziyi Liu ^⋆ Department of Statistical Sciences, University of Toronto and Vector Institute; [email protected]. Idan Attias ^⋆ Department of Computer Science, Ben-Gurion University and Vector Institute; [email protected]. Daniel M. Roy Department of Statistical Sciences, University of Toronto and Vector Institute; [email protected].

Abstract

In this work, we investigate the problem of adapting to the presence or absence of causal structure in multi-armed bandit problems. In addition to the usual reward signal, we assume the learner has access to additional variables, observed in each round after acting. When these variables $d$ -separate the action from the reward, existing work in causal bandits demonstrates that one can achieve strictly better (minimax) rates of regret (Lu et al., 2020). Our goal is to adapt to this favorable “conditionally benign” structure, if it is present in the environment, while simultaneously recovering worst-case minimax regret, if it is not. Notably, the learner has no prior knowledge of whether the favorable structure holds. In this paper, we establish the Pareto optimal frontier of adaptive rates. We prove upper and matching lower bounds on the possible trade-offs in the performance of learning in conditionally benign and arbitrary environments, resolving an open question raised by Bilodeau et al. (2022). Furthermore, we are the first to obtain instance-dependent bounds for causal bandits, by reducing the problem to the linear bandit setting. Finally, we examine the common assumption that the marginal distributions of the post-action contexts are known and show that a nontrivial estimate is necessary for better-than-worst-case minimax rates.

^$\star$^$\star$footnotetext: Equal contribution.

1 Introduction

In real-world decision making, we often want strong worst-case guarantees as well as the ability to adapt to favorable properties of real-world scenarios. Adaptive sequential decision-making offers a framework to design algorithms to achieve these objectives.

In this paper, we explore adaptivity in multi-armed bandit problems. In standard multi-armed bandits, the learner (policy) takes an action, receives a reward, and then this process repeats over a number of rounds. The learner’s regret is the difference between its cumulative reward and the cumulative reward of the single best action in hindsight. Can we work to identify high-reward actions while minimizing regret?

In this work, we assume there is post-action context, i.e., there may be additional information available to the learner after taking an action, beyond the reward signal. In a worst-case analysis, however, the learner can ignore the post-action context and still achieve minimax rates of regret: the worst-case environment will not offer useful information. However, many real-world settings possess the structure of multi-armed bandit problems with post-action context and, in those cases, this additional information is useful towards minimizing regret.

One way that post-action context can be useful is if we can assume causal structure relating the action (i.e., an intervention) to the reward and post-action (post-intervention) context. Several authors have studied models in this vein (Bareinboim et al., 2015; Lattimore et al., 2016). In this work, we build on the framework of Lattimore et al. (2016), wherein the post-action context is assumed to $d$ -separate each intervention from its associated reward.

Under $d$ -separation, the intervention and reward are independent, conditional on the post-intervention context. Bilodeau et al. (2022) formalized this structure in general terms: a bandit environment is conditionally benign whenever the conditional distribution of the reward, given the post-action context, does not depend on the action.

Minimax regret is well understood for both the classical and causal variant of multi-armed bandits. Notably, algorithms tailored to conditionally benign environments can achieve lower rates of regret, scaling with the number of post-action contexts, rather than the potentially much larger set of actions (Lu et al., 2020; Bilodeau et al., 2022).

Exploiting causal structure is not without its pitfalls. Bilodeau et al. (2022) showed that C-UCB, a minimax optimal causal bandit algorithm, suffers linear regret in some non-benign environments. This raised a natural question: Can we achieve strict adaptivity, i.e., obtain minimax rates simultaneously in the class of conditionally benign environments and in the class of all environments, without knowing in advance which class of environments we will face?

Bilodeau et al. proved that strict adaptivity was impossible, but showed some level of adaptivity was possible. They designed a new algorithm, termed HAC-UCB, and proved that it simultaneously achieves minimax optimal rates on the class of benign environments and always achieves (suboptimal, though sublinear) $T^{3/4}$ rates. In light of this result, Bilodeau et al. raised an open problem, asking whether HAC-UCB was, in a sense, Pareto optimal, implying that the slower rate was the price of adaptivity. More generally, we ask:

What is the Pareto optimal frontier of simultaneously achievable rates of regret in the classes of benign and arbitrary environments, and what algorithms achieve these optimal tradeoffs?

In this paper, we address the above question by providing a complete characterization of the Pareto optimal frontier (up to log factors) as well as the achieving algorithms. Besides adaptation, we also study the complexity of causal bandit problems from other perspectives. More specifically, we find a novel reduction from causal bandits to linear bandits, which facilitates the first instance-dependent regret bound for causal bandits and enables the applications of some linear bandit algorithms to causal bandits. We also investigate drop** the common assumption that we have perfect knowledge of “the marginals”, i.e., the distribution of the post-action context variable, under each action. On one hand, we show that it is impossible for any algorithm to enjoy improved minimax regret in benign environments without any knowledge of the true marginals. On the other hand, we identify cases where approximate knowledge of the marginal distributions suffices. Our contributions are explained in more details as follows.

•

In Section 3, we establish near-optimal Pareto regret frontiers for the setting of causal bandits, resolving an open problem raised by Bilodeau et al. (2022), see Figure 1. Utilizing a dynamic balancing method introduced by Cutkosky et al. (2021), we derive the upper bound and also prove near-optimal matching lower bounds. Remarkably, we introduce a phenomenon we call the price of adaptivity, to capture the extra regret that one must incur when attempting to adapt to the presence or lack of causal structure. Consequently, we demonstrate that the model selection method introduced by Cutkosky et al. (2021) cannot be generally improved, for any nontrivial general improvement would decrease the price of adaptivity beyond our lower bound.
•

In Section 4, we present a novel reduction from causal bandits to linear bandits with conditional sub-Gaussian noise. Utilizing a phased elimination technique (Lattimore et al., 2020), we identify a new dimension measuring the inherent complexity of causal bandits. It allows us to establish the first instance-dependent regret bound and a strictly tighter worst-case regret bound for causal bandits for conditionally benign environments. Additionally, we prove instance-dependent bounds for stochastic linear bandits, which are novel to the best of our knowledge.
•

In Section 5, we study the situation where we have limited knowledge of the marginal distributions over post-action contexts. We provide a lower bound indicating that no algorithm can utilize the causal structure to achieve improved minimax rates without such prior knowledge. This partly justifies the common assumption in the causal bandits literature that algorithms are given the marginals. On the other side, we give a regret upper bound for the phased elimination algorithm with access to approximate marginals. This result shows that partial knowledge of the marginals suffices in some regimes.

Refer to caption — Figure 1: The Pareto-optimal frontier of simultaneously achievable rates of regret in (left axis) the class of conditionally benign environments and (bottom axis) the class of all environments. Shaded regions are unobtainable. All rates are determined up to log terms. Among algorithms that achieve minimax rates on conditionally benign environments, the previously best known algorithm (HAC-UCB) is dominated by an instance of Dynamic Balancing, which our results also demonstrate is Pareto optimal.

1.1 Related Work

Causal bandits.

The causal bandit model was introduced by Lattimore et al. (2016), where their objective was to identify the best intervention. Such pure exploration problem has been extensively studied since then (Sen et al., 2017; Xiong & Chen, 2022), while some other works focused on regret minimization (Lu et al., 2020; Nair et al., 2021; Bilodeau et al., 2022). Another interesting topic is to relax the causal assumptions. For example, the assumption of known causal graph can be relaxed (Lu et al., 2021; Malek et al., 2023). Our work mainly builds on the study by Bilodeau et al. (2022) regarding adapting to the existence of causal structures as well as approximate marginals.

Model selection.

To achieve adaptivity, a natural idea is to apply some model selection algorithm on top of a group of base learners. There is an extending line of works studying such corralling strategies in the bandit setting (Agarwal et al., 2017; Pacchiano et al., 2020a, b; Cutkosky et al., 2020; Arora et al., 2021; Cutkosky et al., 2021). Agarwal et al. (2017) required certain stability conditions on the base learners, making their algorithm quite restricted. In contrast, some recently proposed general-purpose model selection algorithms for stochastic bandit problems (Pacchiano et al., 2020b; Cutkosky et al., 2020, 2021) are better candidates in our setting, since they only necessitate mild assumptions on the base learners.

Pareto optimal frontier.

When we have multiple performance metrics but are unable to achieve the best under all of them simultaneously, the Pareto optimal frontier becomes a common objective to pursue subsequently. Problems with several competing benchmarks are abundant in bandit literature (Koolen, 2013; Lattimore, 2015; Marinov & Zimmert, 2021; Zhu & Nowak, 2022).

2 Problem Setup

We consider the problem of stochastic bandit with post-action contexts, as defined by Bilodeau et al. (2022) and follow their notations. Let $\mathcal{A}$ be the finite action space, $\mathcal{Z}$ be the finite context space and $\mathcal{Y}=[0,1]$ be the reward space. For any set $K$ , we use $\mathcal{P}(K)$ to denote the set of all probability distributions supported on $K$ . For any $p\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})$ , we use $p(Z)$ to denote its marginal distribution over $\mathcal{Z}$ , and use $p(Y|Z)$ to denote its the conditional distribution over $\mathcal{Y}$ conditioning on the $Z-$ component.

In this bandit problem, a learner interacts with the stochastic environment for $T$ rounds. The role of the environment is instantiated with a family of distributions $\nu=\{\nu_{a}:a\in\mathcal{A}\}\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{% \mathcal{A}}$ indexed by actions in $\mathcal{A}$ . For each round $t\in[T]$ , the learner picks an action $A_{t}$ from $\mathcal{A}$ and then receives a context-reward pair $(Z_{t},Y_{t})$ which is independently sampled from $\nu_{A_{t}}\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})$ .

To model learner’s strategy, we need to formalize the information that can be used for learner’s prediction. Let $H_{t}=(A_{s},Z_{s},Y_{s})_{s\in[t]}$ denote the observed history up to round $t$ , which is a random variable valued in $\mathcal{H}_{t}:=(\mathcal{A}\times\mathcal{Z}\times\mathcal{Y})^{t}$ . A policy $\pi$ by the learner could be modeled as a sequence of measurable maps from $\mathcal{H}_{t}$ ’s to $\mathcal{A}$

\displaystyle\pi=(\pi_{t})_{t\in[T]}\in\Pi(\mathcal{A},\mathcal{Z},T):=\prod_{% t=1}^{T}\{\mathcal{H}_{t-1}\to\mathcal{A}\},

where $\Pi(\mathcal{A},\mathcal{Z},T)$ is the space of all policies compatible with $(\mathcal{A},\mathcal{Z},T)$ . Then the learner follows this policy by selecting $A_{t}=\pi_{t}(H_{t-1})$ for each round $t$ . Indeed, the distribution of all outcomes over $T$ rounds, i.e. $(A_{t},Z_{t},Y_{t})_{t\in[T]}$ , is determined by the environment $\nu$ and the player’s policy $\pi$ together. We will always highlight the ambient joint distribution by the subscript on probabilistic operators $\mathbb{P}$ and $\mathbb{E}$ , say $\mathbb{E}_{\nu_{a}}$ and $\mathbb{E}_{\nu,\pi}$ . Additionally, we denote the expected reward for action $a$ and the optimal action $a^{*}$ by

\displaystyle\mu^{\mathcal{A}}(a):=\mathbb{E}_{\nu_{a}}[Y],\quad a^{*}:=% \operatorname*{arg\!\max}_{a\in\mathcal{A}}\mu^{\mathcal{A}}(a).

(1)

The goal of the learner is to choose some policy $\pi$ that maximizes her expected cumulative reward $\mathbb{E}_{\nu,\pi}[\sum_{t=1}^{T}Y_{t}]$ , or equivalently minimizes her expected pseudo-regret

\displaystyle\mathbb{E}_{\nu,\pi}[\mathrm{Reg}(T)]:=\mathbb{E}_{\nu,\pi}[\sum_% {t=1}^{T}\max_{a\in\mathcal{A}}\mathbb{E}_{\nu_{a}}[Y]-Y_{t}]=T\cdot\mu^{% \mathcal{A}}(a^{*})-\mathbb{E}_{\nu,\pi}[\sum_{t=1}^{T}\mu^{\mathcal{A}}(A_{t}% )],

with $\mathrm{Reg}(t):=t\cdot\mu^{\mathcal{A}}(a^{*})-\sum_{s=1}^{t}\mu^{\mathcal{A}% }(A_{s}),t\in[T]$ being the realized regret, which is stochastic.

Conditionally benign property and $d$ -separation.

Under certain structures, the post-action context variable $Z$ enables more efficient exploration and hence smaller regret. One special structure that can be exploited for better regret guarantee in our setting is called conditionally benign property, introduced by Bilodeau et al. (2022).

Definition 2.1.

(Bilodeau et al., 2022, Definition 3.1) An environment $\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}$ is conditionally benign if and only if there exists $p\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})$ such that for each $a\in\mathcal{A}$ , $\nu_{a}(Z)\ll p(Z)$ and $\nu_{a}(Y|Z)=p(Y|Z)$ p-a.s. We further denote the space of all conditionally benign environments by $\mathcal{P}_{\mathrm{Benign}}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}$ .

The conditional benign property is quite general in the sense that it is equivalent to or weaker than some well-studied causal assumptions (Bilodeau et al. 2022). In particular, the conditionally benign property is the same thing as the context variable $Z$ being a $d$ -separator when $\mathcal{A}$ is all interventions. To leverage this benign structure, the causal UCB ( $\operatorname{C-UCB}$ ) algorithm recently proposed by Lu et al. (2020) achieves $\tilde{O}(\sqrt{|\mathcal{Z}|T})$ regret, while non-causal algorithms that is unaware of this structure would still incur the possibly worse regret of $\tilde{O}(\sqrt{|\mathcal{A}|T})$ .

2.1 Adaptivity

A natural question is whether we can compete with C-UCB when the environment is conditionally benign while at the same time still maintain the worst-case $\tilde{O}({\sqrt{|\mathcal{A}|T}})$ regret guarantee, without prior knowledge of the nature of the environment. Unfortunately algorithms designed specific to the benign setting may fail drastically in non-benign settings. For instance, C-UCB provably incurs linear regret in some non-benign environments (Bilodeau et al., 2022). To remedy this, Bilodeau et al. (2022) devised $\operatorname{HAC-UCB}$ by adding a hypothesis test in each round, which is used for switching away from C-UCB to UCB irreversibly whenever it detects a deviation from conditionally benign property. $\operatorname{HAC-UCB}$ is able to recover the $\tilde{O}(\sqrt{|\mathcal{Z}|T})$ regret in benign settings and achieve sublinear $\tilde{O}(T^{3/4})$ regret in the worst case.

Prior to this work, we do not know if $\operatorname{HAC-UCB}$ is optimal. Indeed, Bilodeau et al. (2022) showed that strict adaptation, meaning that always achieving the worst-case $O(\sqrt{|\mathcal{A}|T})$ regret while still being able to perform as good as $\operatorname{C-UCB}$ when causal structure exists, is impossible. But this does not rule out the possibility of improving the worst-case $\tilde{O}(T^{3/4})$ regret of $\operatorname{HAC-UCB}$ unilaterally. In this paper we will show that such improvement is indeed feasible and thus obtain an algorithm that dominates $\operatorname{HAC-UCB}$ . Further we will show that our regret guarantee is not improvable through the lens of Pareto optimality.

Remark 2.2.

Regarding optimal rate of regret under the presence of causal structure, it is easy to show a $\Omega(\sqrt{|\mathcal{Z}|T})$ regret lower bound, nearly matching existing $\tilde{O}(\sqrt{|\mathcal{Z}|T})$ regret upper bounds. Whether the log-factors can be shaved from the upper bound is unknown. However, the lower bound of Bilodeau et al. (2022) still implies that strict adaptation is impossible for general $\mathcal{A}$ and $\mathcal{Z}$ , since when $|\mathcal{A}|/|\mathcal{Z}|$ is, say, $\Omega(T^{1/5})$ , the $\Omega(\sqrt{|\mathcal{A}|T})$ lower bound in benign settings rules out a $\tilde{O}(\sqrt{|\mathcal{Z}|T})$ upper bound.

Generic algorithms.

For rigorous treatment of adaptivity, we adopt the definition of algorithms as maps from Bilodeau et al. (2022). Specifically, an algorithm $\mathfrak{a}$ is any map from problem-specific inputs to the space of compatible policies

\displaystyle\mathfrak{a}:(\mathcal{A},\mathcal{Z},T,q)\mapsto\mathfrak{a}(% \mathcal{A},\mathcal{Z},T,q)\in\Pi(\mathcal{A},\mathcal{Z},T),

where $q\in\mathcal{P}(\mathcal{Z})^{\mathcal{A}}$ is the marginal distribution accessed by this algorithm as prior knowledge. When talking about algorithm-induced policies, by default we mean $\mathfrak{a}(\mathcal{A},\mathcal{Z},T,\nu(Z))$ if not stated otherwise, following the common assumption in the literature of causal bandits. We will also deal with the case of imperfect prior knowledge in Section 5, where $q$ may not be the exact $\nu(Z)$ . For notation simplicity, we will use $\mathfrak{a}$ to denote its induced policy $\mathfrak{a}(\mathcal{A},\mathcal{Z},T,q)$ when the problem-specific inputs are clear from context. For example, $\mathbb{E}_{\nu,\mathfrak{a}}$ is the same thing as $\mathbb{E}_{\nu,\mathfrak{a}(\mathcal{A},\mathcal{Z},T,q)}$ .

3 The Pareto Regret Frontier

To formalize our notion of Pareto regret frontier, we need the following definition:

Definition 3.1.

A pair of rate functions $(R_{1}(T;\mathcal{A},\mathcal{Z}),R_{2}(T;\mathcal{A},\mathcal{Z}))$ is said to be realizable if there is an algorithm $\mathfrak{a}$ such that for all $\mathcal{A},\mathcal{Z}$ and $T$ ,

	$\displaystyle\sup_{\nu\in\mathcal{P}_{\mathrm{Benign}}(\mathcal{Z}\times% \mathcal{Y})^{\mathcal{A}}}\mathbb{E}_{\nu,\mathfrak{a}}[\mathrm{Reg}(T)]\leq R% _{1}(T;\mathcal{A},\mathcal{Z}),$
	$\displaystyle\sup_{\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A% }}}\mathbb{E}_{\nu,\mathfrak{a}}[\mathrm{Reg}(T)]\leq R_{2}(T;\mathcal{A},% \mathcal{Z}).$

A pair $(R_{1}(T;\mathcal{A},\mathcal{Z}),R_{2}(T;\mathcal{A},\mathcal{Z}))$ is reasonable if $R_{1}(T;\mathcal{A},\mathcal{Z})\geq\sqrt{|\mathcal{Z}|T}$ and $R_{2}(T;\mathcal{A},\mathcal{Z})\geq\sqrt{|\mathcal{A}|T}$ .

In the following we elide the dependence of rates $R_{i}$ on $\mathcal{A}$ and $\mathcal{Z}$ below for clarity. We can now describe the Pareto regret frontier, i.e., the set of optimal realizable pairs of rates.

Theorem 3.2.

There exists universal constants $C,c,c^{\prime}>0$ such that

1.

Upper bound: If $(R_{1}(T),R_{2}(T))$ is reasonable and $R_{1}(T)R_{2}(T)\geq|\mathcal{A}|T$ , then $(CR_{1}(T)\log T,CR_{2}(T)\log T)$ is realizable;
2.

Lower bound: For all realizable $(R_{1}(T),R_{2}(T))$ , we have $R_{2}(T)>c^{\prime}T$ or $R_{1}(T)R_{2}(T)\geq c|\mathcal{A}|T$ .

Both upper and lower bounds will be extensively discussed in the following sections.

3.1 Upper Bounds

In this section, we show that our upper bound can be obtained by applying the algorithmic principle of dynamic balancing ( $\mathrm{DB}$ ) in Cutkosky et al. (2021) to the stochastic bandit problem with post-action contexts. This method is motivated by the fact that, under mild assumptions, it can always achieve $\tilde{O}(\sqrt{T})$ regret when it is running on top of a collection of $\tilde{O}(\sqrt{T})$ regret base learners. So the dependence on $T$ in $\tilde{O}(T^{3/4})$ regret by $\operatorname{HAC-UCB}$ in Bilodeau et al. (2022) is easily improved. The use of dynamic balancing in our bandit setting can be justified by the fact that dynamic balancing does not rely on what kind of (stochastic) contextual information can be observed in the underlying bandit problem. See Appendix A for a detailed explanation.

Algorithm 1 Dynamic balancing (

\mathrm{DB}

) w/ two base learners

Input: Two base learners, $\{\mathfrak{a}_{i}\}_{i=1,2}$ , factor $d_{i}(\cdot)$ of candidate regret bound, reward bias $b_{i}(\cdot)$ and scaling coefficient $v_{i}$ (hyper-parameters) for each base learner $i\in\{1,2\}$ , and confidence level $\delta\in(0,1)$ .

1.

Set $U_{i}(0)=n_{i}(0)=0$ for all $i\in\{1,2\}$ and let the set of active learners be $\mathcal{I}_{1}=\{1,2\}$
2.
For $t=1,2,...,T$ do
1. (a)
  
  Select learner from the active set: $i_{t}\in\operatorname*{arg\!\min}_{i\in\mathcal{I}_{t}}v_{i}d_{i}(\delta)\sqrt% {n_{i}(t-1)}$
2. (b)
  
  Play action $A_{t}$ of learner $\mathfrak{a}_{i_{t}}$ and receive reward $Y_{t}$ and context $Z_{t}$
3. (c)
  
  Update learner $\mathfrak{a}_{i_{t}}$ with $Z_{t}$ and $Y_{t}$
4. (d)
  
  Update $n_{i}(\cdot)$ and $U_{i}(\cdot)$ :
  $U_{i}(t)\leftarrow U_{i}(t-1)+Y_{t}\mathbb{I}\mathopen{}\left\{i=i_{t}\right\}$
  $n_{i}(t)\leftarrow n_{i}(t-1)+\mathbb{I}\mathopen{}\left\{i=i_{t}\right\}$
5. (e)
  
  Compute adjusted average reward $\eta_{i}(t)$ and confidence band $\gamma_{i}(t)$ for all $i\in\{1,2\}$ :
  $\eta_{i}(t)\leftarrow\frac{U_{i}(t)}{n_{i}(t)}-b_{i}(t)$
  $\gamma_{i}(t)\leftarrow 3\sqrt{\frac{\log(2\log n_{i}(t)/\delta)}{n_{i}(t)}}$
6. (f)
  
  Update the set of active learners:
  $\mathcal{I}_{t+1}\leftarrow\Bigl{\{}i\in\{1,2\}:\eta_{i}(t)+\gamma_{i}(t)+% \frac{d_{i}(\delta)}{\sqrt{n_{i}(t)}}\geq\max_{j=1,2}\eta_{j}(t)+\gamma_{j}(t)% \Bigl{\}}$

Note that dynamic balancing algorithm (Algorithm 1) is input by a set of user-specified candidate regret bounds for each base learner $i$ (which takes the form of $d_{i}\sqrt{t}$ in our setting). In each round, $\mathrm{DB}$ merely picks the base learner with minimal candidate regret bound, and performs a test to identify and deactivate the learners that seem to violate their candidate regret bounds. As long as there is one base learner whose candidate regret is valid, $\mathrm{DB}$ is able to compete with the best of such base learners. A more comprehensive exposition of the idea behind dynamic balancing can be found in Cutkosky et al. (2021).

So naturally, we need one base learner that is favorable in benign instances and another base learner that remains robust to non-benign instances. For example, we can pick $\operatorname{C-UCB}$ and $\operatorname{UCB}$ , but note that any other algorithm with similar regret bound can be applied as well. Formally we characterize base learners that enjoys certain regret bound in certain type of environments by the following definition:

Definition 3.3.

Let $d:(0,1)\to\mathbb{R}_{>0}$ . A family of learners $\mathfrak{a}=(\mathfrak{a}_{\delta})_{\delta\in(0,1)}$ is a $d$ -benign family if, for all $\delta\in(0,1)$ , for all benign instances, with probability at least $1-O(\delta)$ , for all $t\in[T]$ , $\mathfrak{a}_{\delta}$ has regret no larger than $d(\delta)\sqrt{t}$ . Similarly, a learner $\mathfrak{a}$ is a $d$ -arbitrary learner if, for all $\delta\in(0,1)$ , for all instances, with probability at least $1-O(\delta)$ , for all $t\in[T]$ , $\mathfrak{a}$ has regret no larger than $d(\delta)\sqrt{t}$ .

Let $\operatorname{C-UCB}=(\operatorname{C-UCB}(\delta))_{\delta\in(0,1)}$ and $\operatorname{UCB}=(\operatorname{UCB}(\delta))_{\delta\in(0,1)}$ be the families of instances of the $\operatorname{C-UCB}$ and $\operatorname{UCB}$ algorithms, respectively, whose confidence band is scaled by $\Theta(\sqrt{\log(1/\delta)})$ . See Appendix C for details.

Proposition 3.4.

$\operatorname{C-UCB}$ is a $d$ -benign family for $d(\delta)=O((\sqrt{|\mathcal{Z}|}+\sqrt{\log(T/\delta)})\sqrt{\log(|\mathcal{Z% }|T/\delta)})$ and $\operatorname{UCB}$ is a $d^{\prime}$ -arbitrary family for $d^{\prime}(\delta)=O(\sqrt{|\mathcal{A}|\log(|\mathcal{A}|T/\delta)})$ .

Note that the above result for $\operatorname{UCB}$ is folklore, but the result for $\operatorname{C-UCB}$ is new. The following result describes the adaptive regret of dynamic balancing acting on a benign family and an arbitrary family, which validates the upper bound in Theorem 3.2. What is more impressive is that to realize every point on the Pareto regret frontier (up to log factors), we need only tune the hyper-parameters in $\mathrm{DB}$ accordingly. We elide the dependence of rates $R_{i}$ on $\mathcal{A}$ and $\mathcal{Z}$ below for clarity. See Section A.2 for the proof.

Theorem 3.5.

Let $\mathfrak{a}_{1}$ be a $d_{1}$ -benign family and let $\mathfrak{a}_{2}$ be a $d_{2}$ -arbitrary family of learners, where $d_{1}(\delta)=O\mathopen{}\left((\sqrt{|\mathcal{Z}|}+\sqrt{\log(T/\delta)})% \sqrt{\log(|\mathcal{Z}|T/\delta)}\right)$ , $d_{2}(\delta)=O(\sqrt{|\mathcal{A}|\log(|\mathcal{A}|T/\delta)})$ . For every pair of reasonable rate functions $R_{1}(T),R_{2}(T)$ such that $R_{1}(T)R_{2}(T)\geq|\mathcal{A}|T$ , there exist hyper-parameters $b_{i}(\cdot),v_{i}$ , $i=1,2$ , such that, for all instances $\nu$ , the policy $\mathrm{DB}(\delta)$ , for $\delta=1/T$ , given by Algorithm 1 with $\mathfrak{a}_{1},\mathfrak{a}_{2}$ and $d_{1},d_{2}$ , satisfies

	$\displaystyle\mathbb{E}_{\nu,\mathrm{DB}(\delta)}[\mathrm{Reg}(T)]$	$\displaystyle=\tilde{O}(R_{1}(T));\;\nu\text{ is conditionally benign,}$
	$\displaystyle\mathbb{E}_{\nu,\mathrm{DB}(\delta)}[\mathrm{Reg}(T)]$	$\displaystyle=\tilde{O}(R_{2}(T));\;\nu\text{ is arbitrary.}$

Corollary 3.6.

Taking $R_{1}(T)=\sqrt{|\mathcal{Z}|T}$ and $R_{2}(T)=\sqrt{|\mathcal{A}|/|\mathcal{Z}|}\cdot\sqrt{|\mathcal{A}|T}$ , the conclusion of Theorem 3.5 is

	$\displaystyle\mathbb{E}_{\nu,\mathrm{DB}(\delta)}[\mathrm{Reg}(T)]$	$\displaystyle=\tilde{O}(\sqrt{\|\mathcal{Z}\|T});\;\nu\text{ is conditionally % benign,}$
	$\displaystyle\mathbb{E}_{\nu,\mathrm{DB}(\delta)}[\mathrm{Reg}(T)]$	$\displaystyle=\tilde{O}(\sqrt{\|\mathcal{A}\|/\|\mathcal{Z}\|}\cdot\sqrt{\|\mathcal% {A}\|T});\;\nu\text{ is arbitrary.}$

Corollary 3.6 indicates that we need to pay an extra factor of $\sqrt{|\mathcal{A}|/|\mathcal{Z}|}$ in the worst-case regret for adaptivity, and it already improves over the one by $\operatorname{HAC-UCB}$ in terms of worst-case regret. Moreover, our regret analysis does not require their cumbersome assumption that $T\geq 25|\mathcal{A}|^{2}$ . Such improvement may be explained as follows. Both dynamic balancing and $\operatorname{HAC-UCB}$ play with two base learners and decide which to pick in each round. However, $\mathrm{DB}$ is operating in a more reasonable way: $\mathrm{DB}$ alternates between two base learners and never deactivates any of them permanently, whereas $\operatorname{HAC-UCB}$ first plays the optimistic base learner persistently up to some point and then switches to $\operatorname{UCB}$ for the remaining rounds. Thus the regret of $\operatorname{HAC-UCB}$ incurred by running the optimistic base learner improperly may be dominant.

3.2 Lower Bounds

In this section we elaborate on the lower bound in Theorem 3.2 in the following Theorem 3.7, which is a generalization of (Bilodeau et al., 2022, Theorem 6.2). The proof of Theorem 3.7 closely follows that of the original, but we are able to derive a continuum of lower bounds that constitute the Pareto regret frontier. For completeness, we provide the full proof in Section D.1.

Theorem 3.7.

There exists constants $c,c^{\prime}>0$ such that, for all MAB algorithms $\mathfrak{a}$ , rate functions $R(T;\mathcal{A},\mathcal{Z})$ , if, for all $\mathcal{A},\mathcal{Z},T$

\displaystyle\sup_{\nu}\mathbb{E}_{\nu,\mathfrak{a}}[\mathrm{Reg}(T)]\leq R(T;% \mathcal{A},\mathcal{Z}),

then, for all $\mathcal{A},\mathcal{Z}$ and $T$ , there exists a conditionally benign environment $\nu$ such that either $R(T;\mathcal{A},\mathcal{Z})>c^{\prime}T,$ or there exists a conditionally benign environment $\nu$ such that

\displaystyle\mathbb{E}_{\nu,\mathfrak{a}}[\mathrm{Reg}(T)]\geq c\cdot\frac{|% \mathcal{A}|T}{R(T;\mathcal{A},\mathcal{Z})}.

Theorem 3.7 shows that any pair of realizable rates must have their product lower bounded by $|\mathcal{A}|T$ unless the worst-case regret bound is vacuously large. Combining Theorem 3.5 with Theorem 3.7, we have justified the Pareto optimality of dynamic balancing. As a corollary, we have found a problem of adaptation where model selection method can be optimal and the price of adaptivity is witnessed by the additional multiplicative factor of $\sqrt{|\mathcal{A}|/|\mathcal{Z}|}$ in the regret bound.

4 Instance-Dependent Bounds via Phased Elimination Algorithm

Besides achieving Pareto optimal regret bounds in Theorem 3.5 that are worst-case in nature, the dynamic balancing algorithm can also enjoy $O(\log T)$ instance-dependent regret at the same time under additional assumptions on the base learners. In particular, $\operatorname{C-UCB}$ may not be our best choice for the benign base learner. To leverage the strength of dynamic balancing, we propose a new causal bandit algorithm that enjoys $\tilde{O}(\sqrt{|\mathcal{Z}|T})$ worst-case regret and a novel logarithmic instance-dependent regret in benign settings in this section. We are the first to pursue instance-dependent results in conditionally benign environments for algorithms that are minimax optimal (up to log factors).

Our new algorithm is built upon the idea of phased elimination with G-optimal design from linear bandits (Lattimore & Szepesvári, 2020; Lattimore et al., 2020). Our regret analysis hinges on a novel reduction from causal bandits to linear bandits. This reduction enables the use of a broad family of linear bandit algorithms in conditionally benign environments, whose regret guarantees remain intact.

Finally, we will discuss the possibilities and challenges regarding adaptive $O(\log T)$ instance-dependent regret.

4.1 Reduction to Linear Bandits

We need additional notations to illustrate our causal-to-linear reduction. For benign instance $\nu$ , define the mean reward vector $\mu^{\mathcal{Z}}\in[0,1]^{|\mathcal{Z}|}$ by $\mu^{\mathcal{Z}}(z)=\mathbb{E}_{\nu_{a}}[Y|Z=z],\forall z\in\mathcal{Z}$ . Also, in this section we use $\nu_{a}$ to denote its associated marginal distribution vector $\nu_{a}(Z)\in\mathcal{P}(\mathcal{Z})\subset\mathbb{R}^{|\mathcal{Z}|}$ , and we won’t distinguish between an action $a$ and its associated marginal vector $\nu_{a}$ .

Recall that in each round $t$ we play some action $A_{t}$ and then observe context $Z_{t}$ and reward $Y_{t}$ . By simply ignoring the realized contexts $Z_{t}$ , we can write $Y_{t}=\sum_{z\in\mathcal{Z}}\mu^{\mathcal{Z}}(z)\cdot\nu_{A_{t}}(z)+\eta^{% \mathcal{A}}_{t}=\langle\mu^{\mathcal{Z}},\nu_{A_{t}}\rangle+\eta^{\mathcal{A}% }_{t}$ , where $\eta^{\mathcal{A}}_{t}$ is conditionally 1-sub-Gaussian since $\mathbb{E}[\eta^{\mathcal{A}}_{t}|(A_{s},Y_{s})_{s\leq t-1},A_{t}]=0$ and $\eta^{\mathcal{A}}_{t}\in[-1,1]$ . So now we may think of the game to be linear bandit with actions being $\nu_{a}$ and the unknown mean reward vector being $\mu^{\mathcal{Z}}$ . Therefore, any linear bandit algorithm that allows such conditionally sub-Gaussian noise condition should be able to operate in our benign setting by ignoring the realized contexts. More importantly, its regret analysis will go through without change, and hence its regret bounds are retained without loss.

4.2 Phased Elimination and its Regret Bound

Among all valid linear bandit algorithms that can be applied in conditionally benign environments, we opt for the phase elimination algorithm ( $\operatorname{PE}$ ) over others due to its superior performance whenever our action set is finite. Its pseudo-code is summarized in Algorithm 2, which is essentially the same as Lattimore et al. (2020). However, the regret guarantees we present for $\operatorname{PE}$ are novel. Our first result is an anytime worst-case regret bound, which qualifies PE for being a base learner of dynamic balancing. Again, $\operatorname{PE}=(\operatorname{PE}(\delta))_{\delta\in(0,1)}$ is the family of instances of phased elimination algorithm, indexed by the confidence level $\delta$ .

Theorem 4.1 (Worst-case regret bound for $\operatorname{PE}$ ).

For all $\delta\in(0,1)$ , the policy $\operatorname{PE}(\delta)$ given by Algorithm 2 satisfies the following regret bound for all conditionally benign environments $\nu$ ,

\displaystyle\mathrm{Reg}(t)\leq C\sqrt{d_{\nu}\log\mathopen{}\left(\frac{|% \mathcal{A}|\log T}{\delta}\right)t},\quad\forall t\in[T]

with probability at least $1-\delta$ , where $d_{\nu}=\dim(\mathrm{span}\{\nu_{a}:a\in\mathcal{A}\})$ and $C>0$ is a universal constant. Note that $d_{\nu}\leq|\mathcal{Z}|$ and it could be $|\mathcal{Z}|$ in the worst-case. In particular, after taking $\delta=1/T$ , we obtain the expected regret bound

\displaystyle\mathbb{E}_{\nu,\operatorname{PE}(\delta)}[\mathrm{Reg}(T)]=O% \mathopen{}\left(\sqrt{d_{\nu}T\log(|\mathcal{A}|T)}\right).

Corollary 4.2.

$\operatorname{PE}$ is a $d$ -benign family for $d(\delta)=O\mathopen{}\left(\sqrt{|\mathcal{Z}|\log\mathopen{}\left(|\mathcal{% A}|\log T/\delta\right)}\right)$ . Therefore, the expected regret bound in Theorem 3.5 can also be achieved by $\mathrm{DB}$ with $\operatorname{PE}$ and $\operatorname{UCB}$ as base learners.

See Appendix B for the proof. Thanks to our reduction, Theorem 4.1 only depends on $d_{\nu}$ (up to log factors) rather than $|\mathcal{Z}|$ . This indicates that the intrinsic complexity of causal bandit problem is not $|\mathcal{Z}|$ and can be further reduced to $d_{\nu}$ , which is not captured by the $\tilde{O}(\sqrt{|\mathcal{Z}|T})$ regret bound of $\operatorname{C-UCB}$ .

Next we give an instance-dependent regret bound for PE. Notice that this bound is even new for stochastic linear bandits (with finite action sets). See Appendix B for the proof.

Theorem 4.3 (Instance-dependent regret bound for $\operatorname{PE}$ ).

For all $\delta\in(0,1)$ , the policy $\operatorname{PE}(\delta)$ given by Algorithm 2 satisfies the following regret for all conditionally benign environments $\nu$ ,

\displaystyle\mathrm{Reg}(T)\leq C\cdot\frac{d_{\nu}\log\mathopen{}\left(|% \mathcal{A}|\log T/\delta\right)}{\Delta_{\min}(\nu)}

with probability at least $1-\delta$ , where $\Delta_{\min}(\nu):=\min_{a\neq a^{*}}\mu^{\mathcal{A}}(a^{*})-\mu^{\mathcal{A% }}(a)$ is the minimal sub-optimality gap of instance $\nu$ and $C>0$ is a universal constant. In particular, taking $\delta=1/T$ ,

\displaystyle\mathbb{E}_{\nu,\operatorname{PE}(\delta)}[\mathrm{Reg}(T)]=O% \mathopen{}\left(\frac{d_{\nu}\log(|\mathcal{A}|T)}{\Delta_{\min}(\nu)}\right).

Algorithm 2 Phased Elimination (

\operatorname{PE}

) in Causal Bandit

Input: Action set $\mathcal{A}$ , marginals $\{\nu_{a}:a\in\mathcal{A}\}$ , $d_{\nu}=\dim(\mathrm{span}\{\nu_{a}:a\in\mathcal{A}\})$ , and confidence level $\delta\in(0,1)$

1.

Set $\ell=1$ and let the initial active set $\mathcal{A}_{1}$ be $\mathcal{A}$
2.

Find some near-optimal design $\pi_{\ell}\in\mathcal{P}(\mathcal{A}_{\ell})$ with $\max_{a\in\mathcal{A}_{\ell}}\|\nu_{a}\|^{2}_{V(\pi_{\ell})^{-1}}\leq 2d_{\nu}$ and $|\mathrm{supp}(\pi_{\ell})|\leq 4d_{\nu}\log\log(d_{\nu})+16$ , where $V(\pi_{\ell})=\sum_{a\in\mathcal{A}_{\ell}}\pi_{\ell}(a)\nu_{a}\nu_{a}^{\top}$
3.

Let $m_{\ell}=2^{\ell-1}(4d_{\nu}\log\log(d_{\nu})+16)$ . Compute $T_{\ell}(a)=\mathopen{}\left\lceil m_{\ell}\pi_{\ell}(a)\right\rceil$ and $T_{\ell}=\sum_{a\in\mathcal{A}_{\ell}}T_{\ell}(a)$
4.

Play each action $a\in\mathcal{A}_{\ell}$ exactly $T_{\ell}(a)$ times and we call these $T_{\ell}$ rounds phase $\ell$ . We also observe corresponding context-reward pairs $(Z_{t},Y_{t})_{t\in\text{phase }\ell}$
5.

Compute the empirical estimate: $\hat{\mu}^{\mathcal{Z}}_{\ell}=V_{\ell}^{-1}\sum_{t\in\text{phase }\ell}\nu_{A% _{t}}Y_{t},$ where $V_{\ell}=\sum_{a\in\mathcal{A}_{\ell}}T_{\ell}(a)\nu_{a}\nu_{a}^{\top}$
6.

Eliminate low rewarding actions and update the active set:
$\mathcal{A}_{\ell+1}=\Bigl{\{}a\in\mathcal{A}_{\ell}:\max_{b\in\mathcal{A}_{% \ell}}\langle\hat{\mu}^{\mathcal{Z}}_{\ell},\nu_{b}-\nu_{a}\rangle\leq 2\sqrt{% \frac{4d_{\nu}}{m_{\ell}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}% {\delta}\right)}\Bigl{\}}$
7.

$\ell\leftarrow\ell+1$ and Goto 2

Remark 4.4.

During the implementation of Algorithm 2, it is possible that $\mathcal{A}_{\ell}$ cannot span $\mathbb{R}^{|\mathcal{Z}|}$ for some $\ell$ such that $V(\pi_{\ell})$ is singular for any $\pi_{\ell}\in\mathcal{P}(\mathcal{A}_{\ell})$ . For example, in later phases $|\mathcal{A}_{\ell}|$ can be smaller than $|\mathcal{Z}|$ . Let’s say $\dim(\mathrm{span}\{\nu_{a}:a\in\mathcal{A}_{\ell}\})=r<|\mathcal{Z}|$ . One workaround is to apply some invertible matrix $X\in\mathbb{R}^{|\mathcal{Z}|\times|\mathcal{Z}|}$ to every $a\in\mathcal{A}_{\ell}$ such that $X\nu_{a}$ can be decomposed to a dim- $r$ vector $(X\nu_{a})_{[r]}$ and a tail of $(|\mathcal{Z}|-r)$ zeros, and $\mathopen{}\left\{(X\nu_{a})_{[r]}:a\in\mathcal{A}_{\ell}\right\}$ can span $\mathbb{R}^{r}$ . Now we use $\mathopen{}\left\{(X\nu_{a})_{[r]}:a\in\mathcal{A}_{\ell}\right\}$ as our active set in phase $\ell$ and the analysis would go through.

4.3 Roadblocks: Instance-Dependent Bounds

Unlike adaptive worst-case regret studied in Section 3, adaptive instance-dependent regret is less understood and a general theory is still absent in the literature. In particular, we do not know if $O(\log T)$ regret can always be achieved, and whenever achieved, whether it is tight. These issues are illustrated for model selection methods in the following. First, it is easy to see that $O(\log T)$ regret can always be achieved in benign environments, e.g., by corralling $\operatorname{PE}$ and $\operatorname{UCB}$ using dynamic balancing, because in this case both base learners admit logarithmic regret. However, the regret bound of $\operatorname{UCB}$ is dominant and thus naive calculation only leads to a $O(|\mathcal{A}|\log T/\Delta_{\min})$ regret for $\mathrm{DB}$ . It remains open whether we can adapt to the smaller regret achieved by PE in benign environments. Second, $O(\log T)$ regret is not always granted by model selection in non-benign instances. The only exception we are aware of in the literature is the case where the causal base learner is assumed to incur linear regret whenever its candidate regret bound fails (Cutkosky et al., 2021, Theorem 31). If this type of “algorithm gap” holds, $\mathrm{DB}$ will only choose the causal base learner on a $O(\log T)$ number of rounds, and hence enjoy logarithmic regret. Moreover, without changing the parameter setting, $\mathrm{DB}$ is able to realize the Pareto optimal rates $(\sqrt{|\mathcal{Z}|T},|\mathcal{A}|\sqrt{T}/\sqrt{|\mathcal{Z}|})$ up to log factors. However, the “algorithmic gap” requirement on the causal base learner is so stringent that we do not know if it is met by any algorithm in every instance. In Appendix E, we show that a version of $\operatorname{PE}$ incurs linear regret on some instances.

5 Limited Knowledge of the Marginal Distributions over Context Variables

So far, we have assumed that algorithms knows the marginal distribution over the post-action context for each arm. Of course, perfect knowledge of these marginals may not hold in practice. What is the effect of only having access to approximate marginals on achievable rates of regret?

In this section, we study this question. We give a lower bound indicating that, with zero access to the marginals, it is impossible for any algorithm to exploit the causal structure and beat the minimax rate of an arbitrary environment. To model this setting, recall that algorithms are defined as map**s taking $(\mathcal{A},\mathcal{Z},T,q)$ to policies. So naturally, algorithms considered agnostic to the marginals should be constant in $q\in\mathcal{P}(\mathcal{Z})^{\mathcal{A}}$ , leading to the following definition:

Definition 5.1.

An algorithm $\mathfrak{a}$ is said to be agnostic to marginals if, for any $\mathcal{A},\mathcal{Z}$ and $T$ , the map

\displaystyle\mathfrak{a}_{\mathcal{A},\mathcal{Z},T}:q\mapsto\mathfrak{a}(% \mathcal{A},\mathcal{Z},T,q)

is constant over $\mathcal{P}(\mathcal{Z})^{\mathcal{A}}$ . We denote the set of all such algorithms by $\mathbb{A}_{\mathrm{agnostic}}$ .

Examples of algorithms from $\mathbb{A}_{\mathrm{agnostic}}$ include not only heuristic non-causal algorithms like $\operatorname{UCB}$ , but also versions of causal algorithms that are always input by the same marginals. For all algorithm $\mathfrak{a}\in\mathbb{A}_{\mathrm{agnostic}}$ , we will write the policy it induces given $\mathcal{A},\mathcal{Z},T$ as $\mathfrak{a}(\mathcal{A},\mathcal{Z},T,\cdot)$ to highlight its independence on the $q$ component. Our lower bound shows that, under this zero-marginal-knowledge regime, we cannot do better than the optimal non-causal algorithm.

Theorem 5.2.

For all $\mathcal{A},\mathcal{Z},T\geq|\mathcal{A}|$ and MAB algorithms $\mathfrak{a}\in\mathbb{A}_{\mathrm{agnostic}}$ , there exists a conditionally benign environment $\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}$ such that

\displaystyle\mathbb{E}_{\nu,\mathfrak{a}}[\mathrm{Reg}(T)]\geq c\sqrt{|% \mathcal{A}|T},

where $c>0$ is a universal constant.

See Section D.2 for the proof.

Remark 5.3.

Our lower bound improves on Lu et al. (2020, Theorem 4), which is of the form of $C_{\varepsilon}\sqrt{|\mathcal{A}|}T^{1/2-\varepsilon},\forall\varepsilon>0$ and only holds for some set of non-causal algorithms, which is a strict subset of $\mathbb{A}_{\mathrm{agnostic}}$ .

5.1 Phased Elimination with Approximate Marginals

Despite the negative result Theorem 5.2, we now argue that some level of misspecification is allowed in the prior knowledge of marginals. Upon interacting with environment $\nu$ , suppose we are given some marginal $\tilde{\nu}(Z)\in\mathcal{P}(\mathcal{Z})^{\mathcal{A}}$ which may deviate from the true $\nu(Z)$ to some extent. Now we show that even instantiating $\operatorname{PE}$ with the possibly non-accurate $\tilde{\nu}(Z)$ may yield $\tilde{O}(\sqrt{T})$ regret, following a similar result for $\operatorname{C-UCB}$ by Bilodeau et al. (2022). First we need the the following definition to measure the amount of deviation of $\tilde{\nu}(Z)$ from $\nu(Z)$ .

Definition 5.4.

(Bilodeau et al., 2022, Definition 4.2) For any $\varepsilon\geq 0$ , $\tilde{\nu}(Z)$ and $\nu(Z)$ are said to be $\varepsilon$ -close if

\displaystyle\sup_{a\in\mathcal{A}}\sum_{z\in\mathcal{Z}}\mathopen{}\left|% \tilde{\nu}_{a}(z)-\nu_{a}(z)\right|\leq\varepsilon.

Due to our reduction in Section 4.1, we can find that causal bandits with misspecified marginals is reduced to the well-studied misspecified linear bandits, which yields the following regret bound that subsumes Theorem 4.1. The proof is largely based on the analysis of phased elimination in Lattimore et al. (2020, Proposition 5.1), with necessary modifications for handling conditionally sub-gaussian noises and providing an anytime regret bound. See Appendix B for details.

Theorem 5.5 (Worst-case regret bound, with approximate marginal distributions).

In any conditionally environment $\nu$ suppose we instantiate $\operatorname{PE}(\delta)$ with $\tilde{\nu}(Z)$ . If $\tilde{\nu}(Z)$ and $\nu(Z)$ are $\varepsilon-$ close, then with probability at least $1-\delta$ , the regret of $\operatorname{PE}(\delta)$ is bounded for all rounds $t\in[T]$ by

\displaystyle\mathrm{Reg}(t)\leq C\mathopen{}\left(\sqrt{d_{\tilde{\nu}}\log% \mathopen{}\left(\frac{|\mathcal{A}|\log T}{\delta}\right)t}+\varepsilon t% \sqrt{d_{\tilde{\nu}}}\log T\right),

where $C>0$ is a universal constant and $d_{\tilde{\nu}}=\dim(\mathrm{span}\{\tilde{\nu}_{a}:a\in\mathcal{A}\})$

It is implied that $\varepsilon=\tilde{O}(\sqrt{1/T})$ suffices to recover all aforementioned regret guarantees of phased elimination and dynamic balancing. On the other hand, such numerical requirement on $\varepsilon$ is almost necessary for us to avoid the lower bound in Theorem 5.2: from the proof of Theorem 5.2 we will find that when $\varepsilon=\Omega(\sqrt{|\mathcal{A}|/T})$ , for any algorithm there exists a conditionally benign environment $\nu$ and approximate marginal $\tilde{\nu}(Z)$ such that $\tilde{\nu}(Z)$ and $\nu(Z)$ are $\varepsilon$ -close, but this algorithm would incur $\Omega(\sqrt{|\mathcal{A}|T})$ regret on $\nu$ when it is input by $\tilde{\nu}(Z)$ .

It is worth mentioning that the $\sqrt{d_{\tilde{\nu}}}$ factor in the misspecification term cannot be improved in many regimes for linear bandit algorithms (Lattimore et al., 2020). However, $\operatorname{C-UCB}$ is able to shave this factor off (Bilodeau et al., 2022, Theorem 4.3) by utilizing realized contexts rather than the least-square estimate of the mean reward vector $\mu^{\mathcal{Z}}$ . From this perspective, we see there is a price for pursuing better instance-dependent result by ignoring the context information.

6 Conclusions and Discussions

We provide a comprehensive characterization of the Pareto regret frontier for the bandit problem in the context of adapting to causal structure whenever feasible. We also give the first instance-dependent regret bound under conditionally benign environments, based on our novel causal-to-linear reduction. Finally, we show that the common assumption that we have access to the true marginals is necessary in general but still can be relaxed in some cases.

For future works, it would be important to focus on the design of algorithms that are easier to implement compared to running dynamic balancing over some base learners. On the theoretical side, it would be interesting to investigate other causal bandit scenarios involving adaptivity in light of our Pareto regret frontier. For example, we may define a series of “semi-benign” settings interpolating conditionally benign environments and non-benign environments and study the Pareto regret frontier thereof.

Acknowledgements

ZL is supported by the Vector Research Grant at the Vector Institute. IA is supported by the Vatat Scholarship from the Israeli Council for Higher Education. DMR is supported by an NSERC Discovery Grant and funding through his Canada CIFAR AI Chair at the Vector Institute. The authors would like to thank Tomer Koren, Blair Bilodeau and Csaba Szepesvári for helpful discussions at different stages of this work.

References

Agarwal et al. (2017) Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E. Corralling a band of bandit algorithms. In Conference on Learning Theory. PMLR, 2017.
Arora et al. (2021) Arora, R., Marinov, T. V., and Mohri, M. Corralling stochastic bandit algorithms. In International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
Bareinboim et al. (2015) Bareinboim, E., Forney, A., and Pearl, J. Bandits with unobserved confounders: A causal approach. Advances in Neural Information Processing Systems, 28, 2015.
Bilodeau et al. (2022) Bilodeau, B., Wang, L., and Roy, D. Adaptively exploiting d-separators with causal bandits. Advances in Neural Information Processing Systems, 35, 2022.
Cutkosky et al. (2020) Cutkosky, A., Das, A., and Purohit, M. Upper confidence bounds for combining stochastic bandits. arXiv preprint arXiv:2012.13115, 2020.
Cutkosky et al. (2021) Cutkosky, A., Dann, C., Das, A., Gentile, C., Pacchiano, A., and Purohit, M. Dynamic balancing for model selection in bandits and RL. In International Conference on Machine Learning. PMLR, 2021.
Koolen (2013) Koolen, W. M. The Pareto regret frontier. Advances in Neural Information Processing Systems, 26, 2013.
Lattimore et al. (2016) Lattimore, F., Lattimore, T., and Reid, M. D. Causal bandits: Learning good interventions via causal inference. Advances in Neural Information Processing Systems, 29, 2016.
Lattimore (2015) Lattimore, T. The pareto regret frontier for bandits. Advances in Neural Information Processing Systems, 28, 2015.
Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
Lattimore et al. (2020) Lattimore, T., Szepesvari, C., and Weisz, G. Learning with good feature representations in bandits and in RL with a generative model. In International Conference on Machine Learning. PMLR, 2020.
Lu et al. (2020) Lu, Y., Meisami, A., Tewari, A., and Yan, W. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence. PMLR, 2020.
Lu et al. (2021) Lu, Y., Meisami, A., and Tewari, A. Causal bandits with unknown graph structure. Advances in Neural Information Processing Systems, 34, 2021.
Malek et al. (2023) Malek, A., Aglietti, V., and Chiappa, S. Additive causal bandits with unknown graph. arXiv preprint arXiv:2306.07858, 2023.
Marinov & Zimmert (2021) Marinov, T. V. and Zimmert, J. The pareto frontier of model selection for general contextual bandits. Advances in Neural Information Processing Systems, 34, 2021.
Nair et al. (2021) Nair, V., Patil, V., and Sinha, G. Budgeted and non-budgeted causal bandits. In International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
Pacchiano et al. (2020a) Pacchiano, A., Dann, C., Gentile, C., and Bartlett, P. Regret bound balancing and elimination for model selection in bandits and rl. arXiv preprint arXiv:2012.13045, 2020a.
Pacchiano et al. (2020b) Pacchiano, A., Phan, M., Abbasi Yadkori, Y., Rao, A., Zimmert, J., Lattimore, T., and Szepesvari, C. Model selection in contextual stochastic bandit problems. Advances in Neural Information Processing Systems, 33, 2020b.
Sen et al. (2017) Sen, R., Shanmugam, K., Dimakis, A. G., and Shakkottai, S. Identifying best interventions through online importance sampling. In International Conference on Machine Learning. PMLR, 2017.
Xiong & Chen (2022) Xiong, N. and Chen, W. Pure exploration of causal bandits. arXiv preprint arXiv:2206.07883, 2022.
Zhu & Nowak (2022) Zhu, Y. and Nowak, R. Pareto optimal model selection in linear bandits. In International Conference on Artificial Intelligence and Statistics. PMLR, 2022.

Appendix A Regret Analysis for Dynamic Balancing

In this section we show that the regret guarantees of dynamic balancing in Cutkosky et al. (2021) can be generalized to our problem and provide a proof of our main upper bound Theorem 3.5.

Notations.

For base learner $i$ , we use $\mathrm{CandidReg}_{i}(t)$ to denote its candidate anytime regret bound that is expected to hold in its favorable settings. Throughout we consider $\mathrm{CandidReg}_{i}(t)$ with the form of $d_{i}\sqrt{t}$ , where $d_{i}$ implicitly depends on the confidence parameter $\delta$ . Let $i_{t}$ be the index of the base learner selected in round $t$ . $U_{i}(t)=\sum_{s=1}^{t}Y_{s}\mathbb{I}\mathopen{}\left\{i=i_{s}\right\}$ is the observed cumulative reward in the first $t$ rounds where $i$ is picked, and $n_{i}(t)=\sum_{s=1}^{t}\mathbb{I}\mathopen{}\left\{i=i_{s}\right\}$ is the number of rounds $i$ is picked by the end of round $t$ . The local regret of $i$ up to round $t$ is $\mathrm{Reg}_{i}(t)=n_{i}(t)\mu^{\mathcal{A}}(a^{*})-U_{i}(t)$ . We say learner $i$ is well-specified if $\mathrm{Reg}_{i}(t)\leq\mathrm{CandidReg}_{i}(n_{i}(t))=d_{i}\sqrt{n_{i}(t)},% \forall t\in[T]$ and otherwise it is misspecified. We use $i_{\star}$ to denote any well-specified learner.

A.1 Preliminaries

Roughly speaking, in each round $t$ , dynamic balancing works by (1) running a misspecification test to temporarily de-activate misspecified base learners and (2) picking the learner $i_{t}$ with minimal putative regret $d_{i}\sqrt{n_{i}(t)}$ among all active learners $i$ in this round. In this way, the regret incurred by $\mathrm{DB}$ is comparable to that of the best well-specified learner.

Notice that dynamic balancing was initiated with stochastic contextual bandits (where contexts are revealed prior to actions) in Cutkosky et al. (2021). To see that $\mathrm{DB}$ can also be applied in stochastic bandits with post-action contexts, it is worth identifying several important features of $\mathrm{DB}$ :

1.

First of all, the meta decision by $\mathrm{DB}$ on each round $t$ only depends on the global information, i.e. $U_{i}(t)$ and $n_{i}(t)$ (as well as user-specified $d_{i},b_{i}$ and $v_{i}$ ). In particular, it does not need any information regarding context variables or internal states of base learners.
2.

Second, $\mathrm{DB}$ only updates the selected base learner $i_{t}$ in each round $t$ , and the update only uses the reward and contextual information observed in this round, where the context can be either pre-action or post-action, or both. Thus the regret guarantees of $\mathrm{DB}$ would hold regardless of the nature of contexts given that the internal updates of base learners are not affected.

Therefore, the essence of dynamic balancing does not rely on what kind of (stochastic) contextual information can be observed in the underlying (stochastic) bandit problem due to above observations.

Now we state the worst-case regret bound of $\mathrm{DB}$ in Cutkosky et al. (2021) adapted to our setting. First define the good event

\displaystyle\mathcal{E}(\delta)=\mathopen{}\left\{\forall i\in\{1,2\},\forall t% \in T:|n_{i}(t)\mu^{\mathcal{A}}(a^{*})-U_{i}(t)-\mathrm{Reg}_{i}(t)|\leq c% \sqrt{n_{i}(t)\log\mathopen{}\left(\frac{2\log n_{i}(t)}{\delta}\right)}\right\}

on which we are able to control the regret of $\mathrm{DB}$ . According to the analysis of Cutkosky et al. (2021, Lemma 5), we can fix $c$ to be some absolute constant (which can be actually set to $3$ in our setting) such that $\mathbb{P}_{\nu,\pi}[\mathcal{E}(\delta)]\geq 1-\delta$ for any $\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}$ and $\pi\in\Pi(\mathcal{A},\mathcal{Z},T)$ . Conditioning on $\mathcal{E}(\delta)$ , we have the following regret bound:

Proposition A.1 (Adapted version of Theorem 22 in Cutkosky et al. (2021)).

Let $\mathfrak{a}_{1}$ be a $d_{1}$ -benign family and let $\mathfrak{a}_{2}$ be a $d_{2}$ -arbitrary family of learners. Let $Z_{1},Z_{2}$ be arbitrary positive real numbers. For all $\delta\in(0,1)$ , we can set hyper-parameters

\displaystyle b_{i}(t)=\max\Bigl{\{}\frac{2Z_{i}}{\sqrt{t}},\frac{3\sqrt{2\log% (2\log t/\delta)}}{\sqrt{t}}\Bigl{\}},\quad v_{i}=\sqrt{\frac{Z_{i}}{d_{i}(% \delta)^{3}}}

in dynamic balancing such that, the policy $\mathrm{DB}(\delta)$ given by dynamic balancing with $\mathfrak{a}_{1},\mathfrak{a}_{2},d_{1},d_{2}$ satisfies the following: for all instances $\nu$ , conditioning on $\mathcal{E}(\delta)$ and the existence of a well-specified base learner $i_{\star}$ , the regret of $\mathrm{DB}(\delta)$ is bounded by

\displaystyle\mathrm{Reg}(T)\leq\mathrm{Reg}_{i_{\star}}(T)+C^{\prime}% \mathopen{}\left(\sqrt{\log\mathopen{}\left(\frac{\log T}{\delta}\right)}+Z_{i% _{\star}}d_{i_{\star}}(\delta)+\sum_{i\neq i_{\star}}\frac{d_{i}(\delta)}{Z_{i% }}\right)\sqrt{T},

where $C^{\prime}>0$ is a universal constant.

It is straightforward to see that Proposition A.1 is obtained by taking $M=2,C=1,c=3,\beta=1/2$ , and $W_{1}=W_{2}=\sqrt{2}$ in Cutkosky et al. (2021, Theorem 22).

A.2 Proof of Theorem 3.5

Theorem 3.5 is the immediate consequence of the following regret bound, which is derived by instantiating $Z_{1},Z_{2}$ in Proposition A.1 with specific values.

Proposition A.2.

For every pair of reasonable rate functions $R_{1}(T),R_{2}(T)$ such that $R_{1}(T)R_{2}(T)\geq|\mathcal{A}|T$ , we can instantiate Proposition A.1 with $Z_{1}=1,Z_{2}=\frac{R_{2}(T)}{\sqrt{|\mathcal{A}|T}}$ such that for all $\delta\in(0,1)$ , the policy $\mathrm{DB}(\delta)$ with the same setup as Proposition A.1 satisfies the following: for all instances $\nu$ , with probability at least $1-O(\delta)$ , the regret of $\mathrm{DB}(\delta)$ is bounded by

	$\displaystyle\mathrm{Reg}(T)$	$\displaystyle\leq C^{\prime}\mathopen{}\left(d_{1}(\delta)+\sqrt{\log\mathopen% {}\left(\frac{\log T}{\delta}\right)}+\frac{d_{2}(\delta)}{\sqrt{\|\mathcal{A}\|% T}}R_{1}(T)\right)\sqrt{T},\quad\text{if $\nu$ is conditionally benign;}$
	$\displaystyle\mathrm{Reg}(T)$	$\displaystyle\leq C^{\prime}\mathopen{}\left(d_{1}(\delta)+\sqrt{\log\mathopen% {}\left(\frac{\log T}{\delta}\right)}+\frac{d_{2}(\delta)}{\sqrt{\|\mathcal{A}\|% T}}R_{2}(T)\right)\sqrt{T},\quad\text{if $\nu$ is arbitrary,}$

where $C^{\prime}>0$ is a universal constant.

Now we can see that our main upper bound Theorem 3.5 is proved immediately after taking $d_{1}(\delta)=O\mathopen{}\left((\sqrt{|\mathcal{Z}|}+\sqrt{\log(T/\delta)})% \sqrt{\log(|\mathcal{Z}|T/\delta)}\right)$ , $d_{2}(\delta)=O(\sqrt{|\mathcal{A}|\log(|\mathcal{A}|T/\delta)})$ and $\delta=1/T$ .

Proof of Proposition A.2.

By Definition 3.3, we know that for all conditionally instances $\nu$ , with probability at least $1-O(\delta)$ , learner $\mathfrak{a}_{1}$ is well-specified with $\mathrm{CandidReg}_{1}(t)=d_{1}(\delta)\sqrt{t}$ and the regret bound in Proposition A.1 holds with $i_{\star}=1$ . Plugging in $Z_{1}=1,Z_{2}=\frac{R_{2}(T)}{\sqrt{|\mathcal{A}|T}}$ , the regret of $\mathrm{DB}(\delta)$ is bounded by

\displaystyle\mathrm{Reg}(T)\leq C^{\prime}\mathopen{}\left(2d_{1}(\delta)+% \sqrt{\log\mathopen{}\left(\frac{\log T}{\delta}\right)}+d_{2}(\delta)\frac{% \sqrt{|\mathcal{A}|T}}{R_{2}(T)}\right)\sqrt{T}.

Similarly for all instances $\nu$ , with probability at least $1-O(\delta)$ , learner $\mathfrak{a}_{2}$ is well-specified with $\mathrm{CandidReg}_{2}(t)=d_{2}(\delta)\sqrt{t}$ and the regret bound in Proposition A.1 holds with $i_{\star}=2$ , which is

\displaystyle\mathrm{Reg}(T)\leq C^{\prime}\mathopen{}\left(d_{2}(\delta)+% \sqrt{\log\mathopen{}\left(\frac{\log T}{\delta}\right)}+d_{2}(\delta)\frac{R_% {2}(T)}{\sqrt{|\mathcal{A}|T}}+d_{1}(\delta)\right)\sqrt{T}.

By our assumption that $(R_{1}(T),R_{2}(T))$ is reasonable and $R_{1}(T)R_{2}(T)\geq|\mathcal{A}|T$ , we have that $R_{2}(T)\geq\sqrt{|\mathcal{A}|T}$ and $\frac{|\mathcal{A}|T}{R_{2}(T)}\leq R_{1}(T)$ . Hence the regret of $\mathrm{DB}(\delta)$ for all instances $\nu$ is further bounded by

	$\displaystyle\mathrm{Reg}(T)$	$\displaystyle\leq C^{\prime}\mathopen{}\left(d_{1}(\delta)+\sqrt{\log\mathopen% {}\left(\frac{\log T}{\delta}\right)}+\frac{d_{2}(\delta)}{\sqrt{\|\mathcal{A}\|% T}}R_{1}(T)\right)\sqrt{T},\quad\text{if $\nu$ is conditionally benign;}$
	$\displaystyle\mathrm{Reg}(T)$	$\displaystyle\leq C^{\prime}\mathopen{}\left(d_{1}(\delta)+\sqrt{\log\mathopen% {}\left(\frac{\log T}{\delta}\right)}+\frac{d_{2}(\delta)}{\sqrt{\|\mathcal{A}\|% T}}R_{2}(T)\right)\sqrt{T},\quad\text{if $\nu$ is arbitrary,}$

which completes the proof. ∎

Appendix B Regret analysis of phased elimination

In this section we will prove Theorem 4.3 and Theorem 5.5, while Theorem 4.1 is implied by taking $\varepsilon=0$ in Theorem 5.5. Recall that the proof of Theorem 5.5 is based on the analysis of phased elimination in Lattimore et al. (2020, Proposition 5.1). For simplicity we will use $\mathbb{P}$ and $\mathbb{E}$ to denote the probabilistic operators determined jointly by the underlying conditionally benign environment $\nu$ and the phased elimination algorithm. Also we use $\Delta_{a},\Delta_{\min}$ to denote the true sub-optimality gap $\Delta_{a}(\nu)=\mu^{\mathcal{A}}(a^{*})-\mu^{\mathcal{A}}(a)$ and minimal sub-optimality gap $\Delta_{\min}(\nu)=\min_{a\in\mathcal{A}}\mu^{\mathcal{A}}(a^{*})-\mu^{% \mathcal{A}}(a)$ respectively with regards to the underlying instance $\nu$ .

B.1 Prerequisite

Lemma B.1.

(In-phase concentration) For any phase $\ell$ , let

\displaystyle E^{\mathrm{phase}}_{\ell}(\delta)=\mathopen{}\left\{|\langle\hat% {\mu}^{\mathcal{Z}}_{\ell}-\mu^{\mathcal{Z}},\tilde{\nu}_{a}\rangle|\leq 2% \varepsilon\sqrt{d_{\tilde{\nu}}}+\sqrt{\frac{4d_{\tilde{\nu}}}{m_{\ell}}\log% \mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)},\forall a\in% \mathcal{A}_{\ell}\right\}

and $\mathscr{F}_{\ell}$ be the $\sigma-$ algebra generated by the history up to the start of phase $\ell$ . Then $\mathbb{P}[E^{\mathrm{phase}}_{\ell}(\delta)|\mathscr{F}_{\ell}]\geq 1-\frac{% \delta}{\log_{2}(T)}$ .

Proof of Lemma B.1.

Let $b_{a}=\langle\nu_{a}-\tilde{\nu}_{a},\mu^{\mathcal{Z}}\rangle,\forall a\in% \mathcal{A}$ be the error term due to the use of inaccurate marginals, then we know that $|b_{a}|\leq\varepsilon,\forall a\in\mathcal{A}$ since $\nu(Z)$ and $\tilde{\nu}(Z)$ are $\varepsilon-$ close. Observe that

	$\displaystyle\langle\hat{\mu}^{\mathcal{Z}}_{\ell}-\mu^{\mathcal{Z}},\tilde{% \nu}_{a}\rangle$	$\displaystyle=\langle V_{\ell}^{-1}\sum_{t\in\text{phase }\ell}\tilde{\nu}_{A_% {t}}\tilde{\nu}_{A_{t}}^{\top}\mu^{\mathcal{Z}},\tilde{\nu}_{a}\rangle-\langle% \mu^{\mathcal{Z}},\tilde{\nu}_{a}\rangle$
		$\displaystyle+\langle V_{\ell}^{-1}\sum_{t\in\text{phase }\ell}\tilde{\nu}_{A_% {t}}\eta^{\mathcal{A}}_{t},\tilde{\nu}_{a}\rangle+\langle V_{\ell}^{-1}\sum_{t% \in\text{phase }\ell}\tilde{\nu}_{A_{t}}b_{A_{t}},\tilde{\nu}_{a}\rangle$
		$\displaystyle=\sum_{t\in\text{phase }\ell}\langle V_{\ell}^{-1}\tilde{\nu}_{A_% {t}},\tilde{\nu}_{a}\rangle\eta^{\mathcal{A}}_{t}+\sum_{t\in\text{phase }\ell}% \langle V_{\ell}^{-1}\tilde{\nu}_{A_{t}},\tilde{\nu}_{a}\rangle b_{A_{t}}.$

Using Cauchy-Schwarz inequality and the fact that for all $a\in\mathcal{A}_{\ell},\|\tilde{\nu}_{a}\|_{V_{\ell}^{-1}}^{2}\leq\frac{1}{m_{% \ell}}\|\tilde{\nu}_{a}\|_{V(\pi_{\ell})^{-1}}^{2}\leq\frac{2d_{\tilde{\nu}}}{% m_{\ell}}$ , the second term on the RHS of the above equality can be bounded by

	$\displaystyle\mathopen{}\left\|\sum_{t\in\text{phase }\ell}\langle V_{\ell}^{-1% }\tilde{\nu}_{A_{t}},\tilde{\nu}_{a}\rangle b_{A_{t}}\right\|$	$\displaystyle\leq\varepsilon\sum_{t\in\text{phase }\ell}\mathopen{}\left\|% \langle V_{\ell}^{-1}\tilde{\nu}_{A_{t}},\tilde{\nu}_{a}\rangle\right\|$
		$\displaystyle\leq\varepsilon\sqrt{\mathopen{}\left(\sum_{t\in\text{phase }\ell% }1\right)\mathopen{}\left(\sum_{t\in\text{phase }\ell}\langle V_{\ell}^{-1}% \tilde{\nu}_{A_{t}},\tilde{\nu}_{a}\rangle^{2}\right)}$
		$\displaystyle=\varepsilon\sqrt{T_{\ell}\\|\tilde{\nu}_{a}\\|_{V_{\ell}^{-1}}^{2}% }\leq\varepsilon\sqrt{2m_{\ell}\frac{2d_{\tilde{\nu}}}{m_{\ell}}}=2\varepsilon% \sqrt{d_{\tilde{\nu}}}.$

To bound the first term, notice that $(A_{t})_{t\in\text{phase }\ell},V_{\ell}$ are fixed given the history prior to the start of phase $\ell$ . Hence $(\eta^{\mathcal{A}}_{t})_{t\in\text{phase }\ell}$ are independent conditioned on $\mathscr{F}_{\ell}$ and bounded by $[-1,1]$ . By standard concentration bounds, we have that with probability at least $1-\frac{\delta}{|\mathcal{A}|\log_{2}(T)}$ ,

\displaystyle\mathopen{}\left|\sum_{t\in\text{phase }\ell}\langle V_{\ell}^{-1% }\nu_{A_{t}},\tilde{\nu}_{a}\rangle\eta^{\mathcal{A}}_{t}\right|

\displaystyle\leq\sqrt{2\sum_{t\in\text{phase }\ell}\langle V_{\ell}^{-1}% \tilde{\nu}_{A_{t}},\tilde{\nu}_{a}\rangle^{2}\log\mathopen{}\left(\frac{2|% \mathcal{A}|\log_{2}(T)}{\delta}\right)},

where the RHS can be rewritten as

\displaystyle\sqrt{2\|\tilde{\nu}_{a}\|_{V_{\ell}^{-1}}^{2}\log\mathopen{}% \left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)}\leq\sqrt{\frac{4d_{% \tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{% \delta}\right)}.

Combining the two upper bounds above and taking a union bound over all $a\in\mathcal{A}_{\ell}$ , we have that with probability at least $1-\frac{\delta}{\log_{2}(T)}$ ,

\displaystyle|\langle\hat{\mu}^{\mathcal{Z}}_{\ell}-\mu^{\mathcal{Z}},\tilde{% \nu}_{a}\rangle|\leq 2\varepsilon\sqrt{d_{\tilde{\nu}}}+\sqrt{\frac{4d_{\tilde% {\nu}}}{m_{\ell}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}% \right)},\quad\forall a\in\mathcal{A}_{\ell},

which finishes the proof. ∎

Since the marginal distributions $\tilde{\nu}_{a}$ are possibly not accurate, we may not be able to show that the optimal action $a^{*}$ is never eliminated with high probability. So what we can hope for is that actions that are near-optimal relative to the best action in $\mathcal{A}_{\ell}$ are retained in the end of the phase $\ell$ . To be concrete, define $a^{*}_{\ell}\in\operatorname*{arg\!\min}_{a\in\mathcal{A}_{\ell}}\Delta_{a}$ to be the true optimal action within $\mathcal{A}_{\ell}$ . Then we can show that $\Delta_{a}-\Delta_{a^{*}_{\ell}}$ is rather small for any $a$ that is not eliminated in the end of phase $\ell$ .

Lemma B.2.

Conditioning on event $E^{\mathrm{phase}}_{\ell}(\delta)$ , for any action $a$ not eliminated in the end of phase $\ell$ , it has relative sub-optimality gap $\langle\mu^{\mathcal{Z}},\nu_{a^{*}_{\ell}}-\nu_{a}\rangle=\Delta_{a}-\Delta_{% a^{*}_{\ell}}\leq 2\varepsilon(1+2\sqrt{d_{\tilde{\nu}}})+4\sqrt{\frac{4d_{% \tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{% \delta}\right)}$ .

Proof of Lemma B.2.

According to the rule of updating active set, whenever $a\in\mathcal{A}_{\ell}$ is not eliminated at the end of phase $\ell$ , it holds

\displaystyle\langle\hat{\mu}^{\mathcal{Z}}_{\ell},\tilde{\nu}_{a^{*}_{\ell}}-% \tilde{\nu}_{a}\rangle\leq\max_{b\in\mathcal{A}_{\ell}}\langle\hat{\mu}^{% \mathcal{Z}}_{\ell},\tilde{\nu}_{b}-\tilde{\nu}_{a}\rangle\leq 2\sqrt{\frac{4d% _{\tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}% {\delta}\right)}.

It implies that

	$\displaystyle\langle\mu^{\mathcal{Z}},\tilde{\nu}_{a^{*}_{\ell}}-\tilde{\nu}_{% a}\rangle$	$\displaystyle=\langle\mu^{\mathcal{Z}}-\hat{\mu}^{\mathcal{Z}}_{\ell},\tilde{% \nu}_{a^{}_{\ell}}-\tilde{\nu}_{a}\rangle+\langle\hat{\mu}^{\mathcal{Z}}_{% \ell},\tilde{\nu}_{a^{}_{\ell}}-\tilde{\nu}_{a}\rangle$
		$\displaystyle\leq 2\mathopen{}\Big{(}\sqrt{\frac{4d_{\tilde{\nu}}}{m_{\ell}}% \log\mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}\right)}+2% \varepsilon\sqrt{d_{\tilde{\nu}}}\Big{)}+2\sqrt{\frac{4d_{\tilde{\nu}}}{m_{% \ell}}\log\mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}\right)}$
		$\displaystyle=4\sqrt{\frac{4d_{\tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(% \frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}\right)}+4\varepsilon\sqrt{d_{\tilde{% \nu}}}.$

where we use the fact that we are conditioning on $E^{\mathrm{phase}}_{\ell}(\delta)$ in the inequality. Hence under the true marginals $\nu$ ,

\displaystyle\langle\mu^{\mathcal{Z}},\nu_{a^{*}_{\ell}}-\nu_{a}\rangle\leq 4% \sqrt{\frac{4d_{\tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(\frac{2|\mathcal{A% }|\log_{2}(T)}{\delta}\right)}+4\varepsilon\sqrt{d_{\tilde{\nu}}}+2\varepsilon.

∎

Now we need to track $\Delta_{a^{*}_{\ell}}$ , the sub-optimality of the best active action in each phase. Observe that $\Delta_{a^{*}_{\ell}}=\sum_{k=1}^{\ell-1}(\Delta_{a^{*}_{k+1}}-\Delta_{a^{*}_{% k}})$ since $\Delta_{a^{*}_{1}}=\Delta_{a^{*}}=0$ . Then it suffices to control each $\Delta_{a^{*}_{k+1}}-\Delta_{a^{*}_{k}},k\leq\ell-1$ , to control the growth of $\Delta_{a^{*}_{\ell}}$ .

Lemma B.3.

Conditioning on event $E^{\mathrm{phase}}_{\ell}(\delta)$ , we have $\Delta_{a^{*}_{\ell+1}}-\Delta_{a^{*}_{\ell}}\leq 2\varepsilon(1+2\sqrt{d_{\nu% }})$ .

Proof of Lemma B.3.

Suppose $E^{\mathrm{phase}}_{\ell}(\delta)$ happens. Notice that the results holds trivially if $a^{*}_{\ell}$ is not eliminated in the end of phase $\ell$ , because in this case $a^{*}_{\ell+1}=a^{*}_{\ell}$ . On the other hand, if $a^{*}_{\ell}$ is eliminated, define $\hat{a}_{\ell}\in\operatorname*{arg\!\max}_{a\in\mathcal{A}_{\ell}}\langle\hat% {\mu}^{\mathcal{Z}}_{\ell},\tilde{\nu}_{a}\rangle$ to be the empirically best action in the end of phase $\ell$ and then we have

\displaystyle\langle\hat{\mu}^{\mathcal{Z}}_{\ell},\tilde{\nu}_{\hat{a}_{\ell}% }-\tilde{\nu}_{a^{*}_{\ell}}\rangle>2\sqrt{\frac{4d_{\tilde{\nu}}}{m_{\ell}}% \log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)},

according to the test performed. In the meantime, recall that due to in-phase concentration and $\varepsilon-$ closeness between $\tilde{\nu}$ and $\nu$ ,

	$\displaystyle\langle\hat{\mu}^{\mathcal{Z}}_{\ell},\tilde{\nu}_{\hat{a}_{\ell}% }-\tilde{\nu}_{a^{*}_{\ell}}\rangle$	$\displaystyle\leq\langle\mu^{\mathcal{Z}},\tilde{\nu}_{\hat{a}_{\ell}}-\tilde{% \nu}_{a^{*}_{\ell}}\rangle+4\varepsilon\sqrt{d_{\tilde{\nu}}}+2\sqrt{\frac{4d_% {\tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)}{% \delta}\right)}$
		$\displaystyle\leq\langle\mu^{\mathcal{Z}},\nu_{\hat{a}_{\ell}}-\nu_{a^{*}_{% \ell}}\rangle+2\varepsilon+4\varepsilon\sqrt{d_{\tilde{\nu}}}+2\sqrt{\frac{4d_% {\tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)}{% \delta}\right)}.$

Hence we get

\displaystyle\Delta_{\hat{a}_{\ell}}-\Delta_{a^{*}_{\ell}}=\langle\mu^{% \mathcal{Z}},\nu_{a^{*}_{\ell}}-\nu_{\hat{a}_{\ell}}\rangle\leq 2\varepsilon+4% \varepsilon\sqrt{d_{\tilde{\nu}}}

and

\displaystyle\Delta_{a^{*}_{\ell+1}}-\Delta_{a^{*}_{\ell}}\leq\Delta_{\hat{a}_% {\ell}}-\Delta_{a^{*}_{\ell}}\leq 2\varepsilon+4\varepsilon\sqrt{d_{\tilde{\nu% }}}.

∎

Corollary B.4.

For any $\ell\geq 2$ and conditioning on $\bigcap_{k\leq\ell-1}E^{\mathrm{phase}}_{k}(\delta)$ , we have that $\Delta_{a^{*}_{\ell}}\leq 2\varepsilon(\ell-1)(1+2\sqrt{d_{\tilde{\nu}}})$ and $\Delta_{a}\leq 2\varepsilon(1+2\sqrt{d_{\tilde{\nu}}})+4\sqrt{\frac{4d_{\tilde% {\nu}}}{m_{\ell-1}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{% \delta}\right)}+2\varepsilon(\ell-2)(1+2\sqrt{d_{\tilde{\nu}}})$ for all $a\in\mathcal{A}_{\ell}$ .

Proof of Corollary B.4.

By conditioning on the intersection of all $E^{\mathrm{phase}}_{k}(\delta),k\leq\ell-1$ , we have that

\displaystyle\Delta_{a^{*}_{k+1}}-\Delta_{a^{*}_{k}}\leq 2\varepsilon(1+2\sqrt% {d_{\tilde{\nu}}}),\forall k\leq\ell-1,

which implies that $\Delta_{a^{*}_{s}}=\sum_{k=1}^{s-1}(\Delta_{a^{*}_{k+1}}-\Delta_{a^{*}_{k}})% \leq 2\varepsilon(s-1)(1+2\sqrt{d_{\tilde{\nu}}}),\forall s\leq\ell$ . In particular, there is

\displaystyle\Delta_{a^{*}_{\ell}}\leq 2\varepsilon(\ell-1)(1+2\sqrt{d_{\tilde% {\nu}}}).

Since every action $a\in\mathcal{A}_{\ell}$ passes the test in the end of $(\ell-1)-$ th phase and hence is not eliminated, by Lemma B.2 we know

\displaystyle\Delta_{a}-\Delta_{a^{*}_{\ell-1}}\leq 2\varepsilon(1+2\sqrt{d_{% \tilde{\nu}}})+4\sqrt{\frac{4d_{\tilde{\nu}}}{m_{\ell-1}}\log\mathopen{}\left(% \frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)}.

Therefore, for all $a\in\mathcal{A}_{\ell}$ ,

\displaystyle\Delta_{a}=\Delta_{a}-\Delta_{a^{*}_{\ell-1}}+\Delta_{a^{*}_{\ell% -1}}\leq 2\varepsilon(1+2\sqrt{d_{\tilde{\nu}}})+4\sqrt{\frac{4d_{\tilde{\nu}}% }{m_{\ell-1}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}% \right)}+2\varepsilon(\ell-2)(1+2\sqrt{d_{\tilde{\nu}}}).

∎

B.2 Proof of Theorem 5.5

Now we are prepared to prove Theorem 5.5.

Proof of Theorem 5.5.

Let $\ell_{\text{max}}(t)$ be the index of the phase where round $t$ is located. It’s easy to see that $\ell_{\text{max}}(T)\leq\log_{2}(T)$ . In the following we condition on the event $\bigcap_{\ell\leq\ell_{\text{max}}(T)}E^{\mathrm{phase}}_{\ell}(\delta)$ , which happens with probability at least $1-\delta$ due to Lemma B.1.

Notice that phase $\ell_{\text{max}}(t)$ is not necessarily completed in the end of round $t$ , but we can always round $\mathrm{Reg}(t)$ to the regret incurred in the first $\ell_{\text{max}}(t)$ complete phases. That is,

\displaystyle\mathrm{Reg}(t)\leq\sum_{\ell=1}^{\ell_{\text{max}}(t)}\sum_{a\in% \mathcal{A}_{\ell}}T_{\ell}(a)\cdot\Delta_{a}.

Since we have controlled sub-optimality of all active actions in Corollary B.4, it holds with probability at least $1-\delta$ that

	$\displaystyle\mathrm{Reg}(t)$	$\displaystyle\leq\sum_{\ell=1}^{\ell_{\text{max}}(t)}\sum_{a\in\mathcal{A}_{% \ell}}T_{\ell}(a)\cdot\Delta_{a}$
		$\displaystyle\leq 2m_{1}+C\sum_{\ell=2}^{\ell_{\text{max}}(t)}m_{\ell}% \mathopen{}\left(\sqrt{\frac{d_{\tilde{\nu}}}{m_{\ell-1}}\log\mathopen{}\left(% \frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}\right)}+\varepsilon\ell\sqrt{d_{% \tilde{\nu}}}\right)$
		$\displaystyle\leq 2m_{1}+C\sum_{\ell=2}^{\ell_{\text{max}}(t)}\sqrt{m_{\ell}% \cdot d_{\tilde{\nu}}\cdot\log\mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)% }{\delta}\right)}+C\varepsilon\sqrt{d_{\tilde{\nu}}}\sum_{\ell=2}^{\ell_{\text% {max}}(t)}m_{\ell}\ell$
		$\displaystyle\leq 2m_{1}+C\sqrt{m_{\ell_{\text{max}}(t)}\cdot d_{\tilde{\nu}}% \cdot\log\mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}\right)}+C% \varepsilon\sqrt{d_{\tilde{\nu}}}m_{\ell_{\text{max}}(t)}\log_{2}(T)$
		$\displaystyle\leq C\mathopen{}\left(\sqrt{d_{\tilde{\nu}}t\log\mathopen{}\left% (\frac{2\|\mathcal{A}\|\log T}{\delta}\right)}+\varepsilon t\sqrt{d_{\tilde{\nu}% }}\log T\right),$

where $C>0$ is an absolute constant that can vary from line to line. Thus we have finished the proof. ∎

B.3 Proof of Theorem 4.3

Now we go back to the setting where $\varepsilon=0$ . The only modification needed to work out Theorem 4.3 is an instance-dependent control over the number of phases for which sub-optimal arms are not entirely eliminated.

Proof of Theorem 4.3.

Again suppose $E^{\mathrm{phase}}_{\ell}(\delta)$ happens for all $\ell$ . From Corollary B.4 we know that every suboptimal action $a$ can only be played in those phase $\ell\geq 2$ s.t. $\Delta_{a}\leq 4\sqrt{\frac{4d_{\nu}}{m_{\ell-1}}\log\mathopen{}\left(\frac{2|% \mathcal{A}|\log_{2}(T)}{\delta}\right)}$ in addition to the first phase. Let

\displaystyle\ell_{a}=\max\mathopen{}\Big{\{}\ell\geq 2:\Delta_{a}\leq 4\sqrt{% \frac{4d_{\nu}}{m_{\ell-1}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T% )}{\delta}\right)}\Big{\}}

be the maximal number of phases where $a$ can be played. It is easy to see that

\displaystyle\ell_{a}=2+\mathopen{}\left\lfloor\log_{2}\mathopen{}\left(\frac{% 64d_{\nu}}{m_{1}\Delta_{a}^{2}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{% 2}(T)}{\delta}\right)\right)\right\rfloor.

Hence there are at most $\ell_{\text{max}}=2+\mathopen{}\left\lfloor\log_{2}\mathopen{}\left(\frac{64d_% {\nu}}{m_{1}\Delta_{\min}^{2}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2% }(T)}{\delta}\right)\right)\right\rfloor$ number of phases before all suboptimals are eliminated and $\mathrm{Reg}(T)$ can be controlled more carefully:

	$\displaystyle\mathrm{Reg}(T)$	$\displaystyle\leq\sum_{\ell=1}^{\ell_{\text{max}}}\sum_{a\in\mathcal{A}_{\ell}% }T_{\ell}(a)\cdot\Delta_{a}$
		$\displaystyle\leq 2m_{1}+C\sqrt{m_{\ell_{\text{max}}}\cdot d_{\nu}\cdot\log% \mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}\right)}$
		$\displaystyle=2m_{1}+C\sqrt{2^{\ell_{\text{max}}}\cdot m_{1}\cdot d_{\nu}\cdot% \log\mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}\right)}$
		$\displaystyle\leq 2m_{1}+C\sqrt{\frac{d_{\nu}\log\mathopen{}\left(\frac{2\|% \mathcal{A}\|\log_{2}(T)}{\delta}\right)}{m_{1}\Delta_{\min}^{2}}\cdot m_{1}% \cdot d_{\nu}\log\mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}% \right)}$
		$\displaystyle\leq C\cdot\frac{d_{\nu}\log\mathopen{}\left(\|\mathcal{A}\|\log T/% \delta\right)}{\Delta_{\min}},$

where $C>0$ is an absolute constant that can vary from line to line. Again the above regret bound holds with probability at least $1-\delta$ so we are done. ∎

Appendix C Anytime Regret Bounds for UCB and C-UCB

In this section we verify Proposition 3.4 for $\operatorname{UCB}$ and $\operatorname{C-UCB}$ algorithms for completeness. Note that our anytime regret bound for $\operatorname{C-UCB}$ is new in the literature.

C.1 Preliminaries

For each $t\in[T],a\in\mathcal{A}$ and $z\in\mathcal{Z}$ , define $\mathbb{T}^{\mathcal{A}}_{t}(a)=1\vee\sum_{s=1}^{t}\mathbb{I}\{A_{s}=a\}$ to be the number of action $a$ being chosen in the first $t$ rounds, and define $\mathbb{T}^{\mathcal{Z}}_{t}(z)=1\vee\sum_{s=1}^{t}\mathbb{I}\{Z_{s}(A_{s})=z\}$ to be the number of context $z$ being observed up to the first $t$ rounds. Further define the mean reward estimates $\hat{\mu}^{\mathcal{A}}_{t}(a),\hat{\mu}^{\mathcal{Z}}_{t}(z)$ by

	$\displaystyle\hat{\mu}^{\mathcal{A}}_{t}(a)$	$\displaystyle=\frac{1}{\mathbb{T}^{\mathcal{A}}_{t}(a)}\sum_{s=1}^{t}Y_{s}(A_{% s})\mathbb{I}\{A_{s}=a\}$
	$\displaystyle\hat{\mu}^{\mathcal{Z}}_{t}(z)$	$\displaystyle=\frac{1}{\mathbb{T}^{\mathcal{Z}}_{t}(z)}\sum_{s=1}^{t}Y_{s}(A_{% s})\mathbb{I}\{Z_{s}(A_{s})=z\}$

Then we introduce the upper confidence bounds used by the UCB-type algorithms under consideration. Given any prescribed confidence parameter $\delta\in(0,1)$ , define $\mathrm{UCB}^{\mathcal{A}}_{t}(a)=\hat{\mu}^{\mathcal{A}}_{t}(a)+\sqrt{\frac{% \log(2|\mathcal{A}|T/\delta)}{2\mathbb{T}^{\mathcal{A}}_{t}(a)}},\mathrm{UCB}^% {\mathcal{Z}}_{t}(z)=\hat{\mu}^{\mathcal{Z}}_{t}(z)+\sqrt{\frac{\log(2|% \mathcal{Z}|T/\delta)}{2\mathbb{T}^{\mathcal{Z}}_{t}(z)}}$ and $\widetilde{\mathrm{UCB}}_{t}(a)=\sum_{z\in\mathcal{Z}}\mathrm{UCB}^{\mathcal{Z% }}_{t}(z)\mathbb{P}_{\nu_{a}}[Z=z]$ for each $t\in[T],a\in\mathcal{A}$ and $z\in\mathcal{Z}$ . Furthermore, we use $\operatorname{UCB}(\delta)$ and $\operatorname{C-UCB}(\delta)$ to denote the standard UCB algorithm and C-UCB algorithm (Lu et al. 2020) which run by playing actions $A^{\mathrm{UCB}}_{t}$ and $A^{\mathrm{C-UCB}}_{t}$ at each round $t$ respectively, according to:

	$\displaystyle A^{\mathrm{UCB}}_{t}$	$\displaystyle=\operatorname*{arg\!\max}_{a\in\mathcal{A}}\mathrm{UCB}^{% \mathcal{A}}_{t-1}(a)$
	$\displaystyle A^{\mathrm{C-UCB}}_{t}$	$\displaystyle=\operatorname*{arg\!\max}_{a\in\mathcal{A}}\widetilde{\mathrm{% UCB}}_{t-1}(a).$

Before analyzing the regret of $\operatorname{UCB}(\delta)$ and $\operatorname{C-UCB}(\delta)$ , let’s finally define some high-probability events on which we can control the regret. For any given confidence parameters $\delta,\delta^{\prime}$ , define

\displaystyle E^{\mathcal{A}}(\delta)=\bigg{\{}\forall t\in[T],a\in\mathcal{A}% ,|\hat{\mu}^{\mathcal{A}}_{t}(a)-\mu^{\mathcal{A}}(a)|\leq\sqrt{\frac{\log(2|% \mathcal{A}|T/\delta)}{2\mathbb{T}^{\mathcal{A}}_{t}(a)}}\bigg{\}},

and in conditionally benign environments we additionally define

	$\displaystyle E^{\mathcal{Z}}(\delta)$	$\displaystyle=\bigg{\{}\forall t\in[T],z\in\mathcal{Z},\|\hat{\mu}^{\mathcal{Z}% }_{t}(z)-\mu^{\mathcal{Z}}(z))\|\leq\sqrt{\frac{\log(2\|\mathcal{Z}\|T/\delta)}{2% \mathbb{T}^{\mathcal{Z}}_{t}(z)}}\bigg{\}},$
	$\displaystyle E^{\mathrm{MG}}(\delta^{\prime})$	$\displaystyle=\bigg{\{}\forall t\in[T],\sum_{s=1}^{t}\sum_{z\in\mathcal{Z}}% \frac{1}{\sqrt{\mathbb{T}^{\mathcal{Z}}_{s-1}(z)}}(\mathbb{P}_{\nu_{A_{s}}}[Z=% z]-\mathbb{I}\{Z_{s}=z\})\leq\sqrt{2t\log(T/\delta^{\prime})}\bigg{\}},$

where we recall $\mu^{\mathcal{Z}}(z)=\mathbb{E}_{\nu_{a}}[Y|Z=z]$ is well-defined here. First we can see that $E^{\mathcal{A}}(\delta)$ and $E^{\mathcal{Z}}(\delta)$ happen with probability at least $1-\delta$ regardless the underlying environment and chosen policy:

Lemma C.1 (Lemma B.1 and B.2 in Bilodeau et al. 2022).

For any $\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}$ and $\pi\in\Pi(\mathcal{A},\mathcal{Z},T)$ ,

\displaystyle\mathbb{P}_{\nu,\pi}[(E^{\mathcal{A}}(\delta))^{c}]\leq\delta,

and for any $\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}$ that is conditionally benign and $\pi\in\Pi(\mathcal{A},\mathcal{Z},T)$ ,

\displaystyle\mathbb{P}_{\nu,\pi}[(E^{\mathcal{Z}}(\delta))^{c}]\leq\delta.

To get our new anytime regret bound for $\operatorname{C-UCB}(\delta)$ , we need to further condition on $E^{\mathrm{MG}}(\delta^{\prime})$ which happens with probability at least $1-\delta^{\prime}$ :

Lemma C.2.

For any $\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}$ and $\pi\in\Pi(\mathcal{A},\mathcal{Z},T)$ ,

\displaystyle\mathbb{P}_{\nu,\pi}[(E^{\mathrm{MG}}(\delta^{\prime}))^{c}]\leq% \delta^{\prime}

Proof of Lemma C.2.

Define

	$\displaystyle M_{t}$	$\displaystyle=\sum_{s=1}^{t}\sum_{z\in\mathcal{Z}}\frac{1}{\sqrt{\mathbb{T}^{% \mathcal{Z}}_{s-1}(z)}}(\mathbb{P}_{\nu_{A_{s}}}[Z=z]-\mathbb{I}\{Z_{s}=z\}),% \forall t\in[T],$
	$\displaystyle M_{0}$	$\displaystyle=0.$

Then $E^{\mathrm{MG}}(\delta^{\prime})=\bigg{\{}\forall t\in[T],M_{t}\leq\sqrt{2t% \log(T/\delta^{\prime})}\bigg{\}}$ and it is easy to find that $\{M_{t}\}_{t\geq 0}$ is a martingale sequence with respect to $\mathcal{F}_{t}=\sigma(A_{t},H_{t-1})$ . To see this,

\displaystyle\mathbb{E}_{\nu,\pi}[M_{t}|A_{t},H_{t-1}]=M_{t-1}+\sum_{z\in% \mathcal{Z}}\frac{1}{\sqrt{\mathbb{T}^{\mathcal{Z}}_{t-1}(z)}}\mathbb{E}_{\nu,% \pi}[\mathbb{P}_{\nu_{A_{t}}}[Z=z]-\mathbb{I}\{Z_{t}=z\}|A_{t}]=M_{t-1}.

Also,

	$\displaystyle\|M_{t}-M_{t-1}\|$	$\displaystyle=\mathopen{}\left\|\sum_{z\in\mathcal{Z}}\frac{1}{\sqrt{\mathbb{T}% ^{\mathcal{Z}}_{t-1}(z)}}\mathopen{}\left(\mathbb{P}_{\nu_{A_{t}}}[Z=z]-% \mathbb{I}\{Z_{t}=z\}\right)\right\|$
		$\displaystyle=\mathopen{}\left\|\mathbb{E}_{\nu,\pi}\mathopen{}\Big{[}\sum_{z% \in\mathcal{Z}}\frac{1}{\sqrt{\mathbb{T}^{\mathcal{Z}}_{t-1}(z)}}\mathbb{I}\{Z% _{t}=z\}\|A_{t},H_{t-1}\Big{]}-\sum_{z\in\mathcal{Z}}\frac{1}{\sqrt{\mathbb{T}^% {\mathcal{Z}}_{t-1}(z)}}\mathbb{I}\{Z_{t}=z\}\right\|$
		$\displaystyle=\mathopen{}\left\|\mathbb{E}_{\nu,\pi}\mathopen{}\Big{[}\frac{1}{% \sqrt{\mathbb{T}^{\mathcal{Z}}_{t-1}(Z_{t})}}\|A_{t},H_{t-1}\Big{]}-\frac{1}{% \sqrt{\mathbb{T}^{\mathcal{Z}}_{t-1}(Z_{t})}}\right\|$
		$\displaystyle\leq 1.$

Then by Azuma-Hoeffding,

	$\displaystyle\mathbb{P}_{\nu,\pi}[M_{t}>\sqrt{2t\log(T/\delta^{\prime})}]$	$\displaystyle=\mathbb{P}_{\nu,\pi}[M_{t}-M_{0}>\sqrt{2t\log(T/\delta^{\prime})}]$
		$\displaystyle\leq\exp\mathopen{}\left(-\frac{2t\log(T/\delta^{\prime})}{2t}% \right)=\delta^{\prime}/T,$

and we get $\mathbb{P}_{\nu,\pi}[(E^{\mathrm{MG}}(\delta^{\prime}))^{c}]\leq\delta^{\prime}$ after taking a union bound over $t\in[T]$ . ∎

C.2 Anytime High-probability Regret Bound

Now we provide our high-probability regret bounds for $\operatorname{UCB}(\delta)$ and $\operatorname{C-UCB}(\delta)$ that will lead to Proposition 3.4.

Theorem C.3.

In any environment $\nu$ , the regret of $\operatorname{UCB}(\delta)$ is bounded by

\displaystyle\mathrm{Reg}(t)=O\mathopen{}\left(\sqrt{|\mathcal{A}|\log(|% \mathcal{A}|T/\delta)t}\right)

for all $t\in[T]$ , conditioning on event $E^{\mathcal{A}}(\delta)$ which happens with probability at least $1-\delta$ .

Proof of Theorem C.3.

In event $E^{\mathcal{A}}(\delta)$ , we have that $\mu^{\mathcal{A}}(a)\leq\mathrm{UCB}^{\mathcal{A}}_{t}(a)\leq\mu^{\mathcal{A}}% (a)+2\sqrt{\frac{\log(2|\mathcal{A}|T/\delta)}{2\mathbb{T}^{\mathcal{A}}_{t}(a% )}}$ for all $a\in\mathcal{A},t\in[T]$ . Hence conditioned on $E^{\mathcal{A}}(\delta)$ , the regret of $\operatorname{UCB}(\delta)$ up to any round $t\in[T]$ holds

	$\displaystyle\mathrm{Reg}(t)$	$\displaystyle=\sum_{s=1}^{t}\mu^{\mathcal{A}}(a^{*})-\mu^{\mathcal{A}}(A_{s})$
		$\displaystyle=\sum_{s=1}^{t}(\mu^{\mathcal{A}}(a^{*})-\mathrm{UCB}^{\mathcal{A% }}_{s-1}(A_{s}))+(\mathrm{UCB}^{\mathcal{A}}_{s-1}(A_{s})-\mu^{\mathcal{A}}(A_% {s}))$
		$\displaystyle\leq\sum_{s=1}^{t}(\mathrm{UCB}^{\mathcal{A}}_{s-1}(a^{*})-% \mathrm{UCB}^{\mathcal{A}}_{s-1}(A_{s}))+(\mathrm{UCB}^{\mathcal{A}}_{s-1}(A_{% s})-\mu^{\mathcal{A}}(A_{s}))$
		$\displaystyle\leq\sum_{s=1}^{t}(\mathrm{UCB}^{\mathcal{A}}_{s-1}(A_{s})-\mu^{% \mathcal{A}}(A_{s}))$
		$\displaystyle\leq\sum_{s=1}^{t}\sqrt{\frac{2\log(2\|\mathcal{A}\|T/\delta)}{% \mathbb{T}^{\mathcal{A}}_{s-1}(A_{s})}}$
		$\displaystyle=\sum_{s=1}^{t}\sum_{a\in\mathcal{A}}\sqrt{\frac{2\log(2\|\mathcal% {A}\|T/\delta)}{\mathbb{T}^{\mathcal{A}}_{s-1}(A_{s})}}\mathbb{I}\{A_{s}=a\}$
		$\displaystyle\leq\sum_{a\in\mathcal{A}}\sqrt{8\log(2\|\mathcal{A}\|T/\delta)% \mathbb{T}^{\mathcal{A}}_{t-1}(a)}$
		$\displaystyle\leq\sqrt{8\log(2\|\mathcal{A}\|T/\delta)\|\mathcal{A}\|t},$

where we use $A_{s}=A^{\mathrm{UCB}}_{s}$ throughout to simplify our notation. ∎

Theorem C.4.

In any conditionally benign environment $\nu$ , the regret of $\operatorname{C-UCB}(\delta)$ is bounded by

\displaystyle\mathrm{Reg}(t)=O\mathopen{}\left(\sqrt{\log(|\mathcal{Z}|T/% \delta)}\mathopen{}\left(\sqrt{|\mathcal{Z}|}+\sqrt{\log(T/\delta^{\prime})}% \right)\sqrt{t}\right)

for all $t\in[T]$ , conditioning on event $E^{\mathcal{Z}}(\delta)\cap E^{\mathrm{MG}}(\delta^{\prime})$ which happens with probability at least $1-\delta-\delta^{\prime}$ .

Proof of Theorem C.4.

Similarly in event $E^{\mathcal{Z}}(\delta)$ we have $\mu^{\mathcal{Z}}(z)\leq\mathrm{UCB}^{\mathcal{Z}}_{t}(z)\leq\mu^{\mathcal{Z}}% (z)+2\sqrt{\frac{\log(2|\mathcal{Z}|T/\delta)}{2\mathbb{T}^{\mathcal{Z}}_{t}(z% )}}$ for all $z\in\mathcal{Z},t\in[T]$ . Additionally,

	$\displaystyle\mu^{\mathcal{A}}(a^{*})$	$\displaystyle=\sum_{z\in\mathcal{Z}}\mu^{\mathcal{Z}}(z)\mathbb{P}_{\nu_{a^{*}% }}[Z=z]$
		$\displaystyle\leq\sum_{z\in\mathcal{Z}}\mathrm{UCB}^{\mathcal{Z}}_{t-1}(z)% \mathbb{P}_{\nu_{a^{*}}}[Z=z]$
		$\displaystyle=\widetilde{\mathrm{UCB}}_{t-1}(a^{*})\leq\widetilde{\mathrm{UCB}% }_{t-1}(A_{t}),\forall t\in[T],$

where $A_{t}=A^{\mathrm{C-UCB}}_{t}$ is the action played by $\operatorname{C-UCB}(\delta)$ . Therefore we can control the cumulative the regret of $\operatorname{C-UCB}(\delta)$ in the first $t$ rounds as follows

	$\displaystyle\mathrm{Reg}(t)$	$\displaystyle=\sum_{s=1}^{t}(\mu^{\mathcal{A}}(a^{*})-\widetilde{\mathrm{UCB}}% _{s-1}(A_{s}))+(\widetilde{\mathrm{UCB}}_{s-1}(A_{s})-\mu^{\mathcal{A}}(A_{s}))$
		$\displaystyle\leq\sum_{s=1}^{t}\widetilde{\mathrm{UCB}}_{s-1}(A_{s})-\mu^{% \mathcal{A}}(A_{s})$
		$\displaystyle=\sum_{s=1}^{t}\sum_{z\in\mathcal{Z}}(\mathrm{UCB}^{\mathcal{Z}}_% {s-1}(z)-\mu^{\mathcal{Z}}(z))\mathbb{P}_{\nu_{A_{s}}}[Z=z]$
		$\displaystyle\leq\sum_{s=1}^{t}\sum_{z\in\mathcal{Z}}\sqrt{\frac{2\log(2\|% \mathcal{Z}\|T/\delta)}{\mathbb{T}^{\mathcal{Z}}_{s-1}(z)}}\mathbb{P}_{\nu_{A_{% s}}}[Z=z]$
		$\displaystyle=\sum_{s=1}^{t}\sum_{z\in\mathcal{Z}}\sqrt{\frac{2\log(2\|\mathcal% {Z}\|T/\delta)}{\mathbb{T}^{\mathcal{Z}}_{s-1}(z)}}\mathbb{I}\{Z_{s}=z\}+\sum_{% s=1}^{t}\sum_{z\in\mathcal{Z}}\sqrt{\frac{2\log(2\|\mathcal{Z}\|T/\delta)}{% \mathbb{T}^{\mathcal{Z}}_{s-1}(z)}}(\mathbb{P}_{\nu_{A_{s}}}[Z=z]-\mathbb{I}\{% Z_{s}=z\})$
		$\displaystyle\leq\sqrt{8\log(2\|\mathcal{Z}\|T/\delta)\|\mathcal{Z}\|t}+\sum_{s=1}% ^{t}\sum_{z\in\mathcal{Z}}\sqrt{\frac{2\log(2\|\mathcal{Z}\|T/\delta)}{\mathbb{T% }^{\mathcal{Z}}_{s-1}(z)}}(\mathbb{P}_{\nu_{A_{s}}}[Z=z]-\mathbb{I}\{Z_{s}=z\}),$

where in the last inequality we use the same argument as in the proof of Theorem C.3, and the remaining summation term can be controlled by $\sqrt{4\log(2|\mathcal{Z}|T/\delta)\log(T/\delta^{\prime})t}$ immediately after we further condition on $E^{\mathrm{MG}}(\delta^{\prime})$ . Therefore, we get

\displaystyle\mathrm{Reg}(t)\leq\sqrt{\log(2|\mathcal{Z}|T/\delta)}\mathopen{}% \left(\sqrt{8|\mathcal{Z}|}+\sqrt{4\log(T/\delta^{\prime})}\right)\sqrt{t},% \forall t\in[T],

in event $E^{\mathcal{Z}}(\delta)\cap E^{\mathrm{MG}}(\delta^{\prime})$ . ∎

Combining Theorem C.3 with Theorem C.4 and taking $\delta^{\prime}=\delta$ , we thus verfy Proposition 3.4.

Appendix D Proofs of Lower Bounds

In this section we give the full proof of Theorem 3.7 and Theorem 5.2. Note that our proof of Theorem 3.7 mainly adopts but also largely generalizes the one of Bilodeau et al. (2022, Theorem 6.2).

D.1 Proof of Theorem 3.7

Proof of Theorem 3.7.

Fix $\mathcal{A},\mathcal{Z}$ and $T$ . Let $\mathcal{Z}_{0}$ be an arbitrary proper subset of $\mathcal{Z}$ and $\mathcal{Z}_{1}=\mathcal{Z}\setminus\mathcal{Z}_{0}$ . Fix $\Delta\in(0,1/20)$ to be chosen later. Define the family of marginals for all instances appearing in this proof

\displaystyle q_{a}[Z\in\mathcal{Z}_{0}]=\begin{cases}1/2+2\Delta&a=1\\ 1/2&a\neq 1,\end{cases}

where probability is evenly spread within $\mathcal{Z}_{0}$ and $\mathcal{Z}_{1}$ respectively. Then define a conditionally benign environment $\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}$ by

\displaystyle\mathbb{P}_{\nu_{a}}[Y=1]=\sum_{z\in\mathcal{Z}}p[Y=1|Z=z]q_{a}[Z% =z],\quad\forall a\in\mathcal{A},

where $p[Y|Z]$ is a Bernoulli conditional distribution such that

\displaystyle p[Y=1|Z=z]=\begin{cases}3/4&z\in\mathcal{Z}_{0}\\ 1/4&z\in\mathcal{Z}_{1}.\end{cases}

Now we define some non-benign instances. For every $a_{0}\neq 1$ , define $\nu_{a_{0}}$ by

\displaystyle\mathbb{P}_{\nu_{a}^{a_{0}}}[Y=1]=\sum_{z\in\mathcal{Z}}p_{a}^{a_% {0}}[Y=1|Z=z]q_{a}[Z=z],\quad\forall a\in\mathcal{A},

where $p_{a}^{a_{0}}[Y|Z]$ is a Bernoulli conditional distribution such that

\displaystyle p_{a}^{a_{0}}[Y=1|Z=z]=\begin{cases}3/4&a=1,z\in\mathcal{Z}_{0}% \\ 1/4&a=1,z\in\mathcal{Z}_{1}\\ 3/4+4\Delta&a=a_{0},z\in\mathcal{Z}_{0}\\ 1/4&a=a_{0},z\in\mathcal{Z}_{1}\\ 3/4&a\notin\{1,a_{0}\},z\in\mathcal{Z}_{0}\\ 1/4&a\notin\{1,a_{0}\},z\in\mathcal{Z}_{1}.\end{cases}

For any MAB algorithm $\mathfrak{a}$ , let $\pi^{q}=\mathfrak{a}(\mathcal{A},\mathcal{Z},q,T)$ be the actual policy implemented by $\mathfrak{a}$ when it’s interacting with $\nu$ and $\nu^{a_{0}}$ . Then by the divergence decomposition formula and Bretagnolle-Huber inequality,

	$\displaystyle\mathbb{E}_{\nu,\pi^{q}}[\mathrm{Reg}(T)]+\mathbb{E}_{\nu^{a_{0}}% ,\pi^{q}}[\mathrm{Reg}(T)]$	$\displaystyle\geq\frac{T\Delta}{2}\mathbb{P}_{\nu,\pi^{q}}[\mathbb{T}^{% \mathcal{A}}_{T}(1)\leq T/2]+\frac{T\Delta}{2}\mathbb{P}_{\nu^{a_{0}},\pi^{q}}% [\mathbb{T}^{\mathcal{A}}_{T}(1)>T/2]$
		$\displaystyle\geq\frac{T\Delta}{4}\exp(-\mathrm{KL}(\mathbb{P}_{\nu,\pi^{q}}\ % \\|\ \mathbb{P}_{\nu^{a_{0}},\pi^{q}}))$
		$\displaystyle=\frac{T\Delta}{4}\exp\mathopen{}\left(-\frac{1}{2}\mathbb{E}_{% \nu,\pi^{q}}[\mathbb{T}^{\mathcal{A}}_{T}(a_{0})]\mathrm{KL}(\mathrm{Ber}(3/4)% \ \\|\ \mathrm{Ber}(3/4+4\Delta))\right)$
		$\displaystyle\geq\frac{T\Delta}{4}\exp\mathopen{}\left(-\mathbb{E}_{\nu,\pi^{q% }}[\mathbb{T}^{\mathcal{A}}_{T}(a_{0})]\cdot 32\Delta^{2}\right),$

where in the last step we use $\mathrm{KL}(\mathrm{Ber}(3/4)\ \|\ \mathrm{Ber}(3/4+4\Delta))\leq 64\Delta^{2}$ for $\Delta<1/40$ . Combined with the worst-case regret upper bound $\mathbb{E}_{\nu,\pi^{q}}[\mathrm{Reg}(T)]+\mathbb{E}_{\nu^{a_{0}},\pi^{q}}[% \mathrm{Reg}(T)]\leq 2R(T;\mathcal{A},\mathcal{Z})$ , it implies that

\displaystyle\mathbb{E}_{\nu,\pi^{q}}[\mathbb{T}^{\mathcal{A}}_{T}(a_{0})]\geq% \frac{1}{32\Delta^{2}}\log\mathopen{}\left(\frac{T\Delta}{8R(T;\mathcal{A},% \mathcal{Z})}\right),\forall a_{0}\neq 1.

Realizing $\mathbb{E}_{\nu,\pi^{q}}[\mathrm{Reg}(T)]=\sum_{a_{0}\neq 1}\Delta\mathbb{E}_{% \nu,\pi^{q}}[\mathbb{T}^{\mathcal{A}}_{T}(a_{0})]$ , we have

\displaystyle\mathbb{E}_{\nu,\pi^{q}}[\mathrm{Reg}(T)]\geq\frac{|\mathcal{A}|-% 1}{32\Delta}\log\mathopen{}\left(\frac{T\Delta}{8R(T;\mathcal{A},\mathcal{Z})}% \right).

So there exists absolute constants $c=\log 2/1024,c^{\prime}=1/641$ such that whenever $R(T;\mathcal{A},\mathcal{Z})\leq c^{\prime}T$ , the choice of $\Delta=\frac{16R(T;\mathcal{A},\mathcal{Z})}{T}$ satisfies $\Delta<1/40$ and

\displaystyle\mathbb{E}_{\nu,\pi^{q}}[\mathrm{Reg}(T)]\geq c\cdot\frac{|% \mathcal{A}|T}{R(T;\mathcal{A},\mathcal{Z})},

which completes the proof. ∎

D.2 Proof of Theorem 5.2

Proof of Theorem 5.2.

Fix $\mathcal{A},\mathcal{Z}$ and $T$ . Let $\mathcal{Z}_{0}$ be an arbitrary proper subset of $\mathcal{Z}$ and $\mathcal{Z}_{1}=\mathcal{Z}\setminus\mathcal{Z}_{0}$ . Fix $\Delta\in(0,\frac{1}{40})$ to be chosen later. For all conditionally benign instances $\nu$ in this proof, we consider $\mathbb{P}_{\nu_{a}}[Y|Z]$ to be the Bernoulli distribution given by

\displaystyle\mathbb{P}_{\nu_{a}}[Y=1|Z=z]=p[Y=1|Z=z]=\begin{cases}3/4&z\in% \mathcal{Z}_{0}\\ 1/4&z\in\mathcal{Z}_{1},\end{cases}

which implies that contexts from $\mathcal{Z}_{0}$ are more rewarding than those from $\mathcal{Z}_{1}$ .

Now define conditionally benign environments $\nu,\nu^{a_{0}}\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}},% \forall a_{0}\neq 1$ , through their marginals

	$\displaystyle\mathbb{P}_{\nu_{a}}[Y=1]$	$\displaystyle=\sum_{z\in\mathcal{Z}}p[Y=1\|Z=z]q_{a}[Z=z],$
	$\displaystyle\mathbb{P}_{\nu^{a_{0}}_{a}}[Y=1]$	$\displaystyle=\sum_{z\in\mathcal{Z}}p[Y=1\|Z=z]q^{a_{0}}_{a}[Z=z],\forall a\in% \mathcal{A}$

where

\displaystyle q_{a}[Z\in\mathcal{Z}_{0}]=\begin{cases}1/2+2\Delta&a=1\\ 1/2&a\neq 1\end{cases}\quad\text{and}\quad q^{a_{0}}_{a}[Z\in\mathcal{Z}_{0}]=% \begin{cases}1/2+2\Delta&a=1\\ 1/2+4\Delta&a=a_{0}\\ 1/2&a\neq 1,a_{0},\end{cases}

where probability is evenly spaced within $\mathcal{Z}_{0}$ and $\mathcal{Z}_{1}$ . So clearly action 1 is the only optimal action in $\nu$ and action $a_{0}$ is the only optimal action in $\nu^{a_{0}}$ , with sub-optimality gap $\Delta_{\min}(\nu)=\Delta_{\min}(\nu^{a_{0}})=\Delta$ .

Fix algorithm $\mathfrak{a}\in\mathbb{A}_{\mathrm{agnostic}}$ with $\tilde{\pi}=\mathfrak{a}(\mathcal{A},\mathcal{Z},T,\cdot)$ be the actual policy implemented by $\mathfrak{a}$ . By the divergence decomposition formula (Bilodeau et al. 2022), we have that for every $a_{0}\neq 1$ ,

	$\displaystyle\mathrm{KL}(\mathbb{P}_{\nu,\tilde{\pi}}\ \\|\ \mathbb{P}_{\nu^{a_% {0}},\tilde{\pi}})$	$\displaystyle=\sum_{a\in\mathcal{A}}\mathbb{E}_{\nu,\tilde{\pi}}[\mathbb{T}^{% \mathcal{A}}_{T}(a)]\mathrm{KL}(\mathbb{P}_{\nu_{a}}\ \\|\ \mathbb{P}_{\nu^{a_{% 0}}_{a}})$
		$\displaystyle=\sum_{a\in\mathcal{A}}\mathbb{E}_{\nu,\tilde{\pi}}[\mathbb{T}^{% \mathcal{A}}_{T}(a)]\mathrm{KL}(q_{a}\ \\|\ q^{a_{0}}_{a})$
		$\displaystyle=\mathbb{E}_{\nu,\tilde{\pi}}[\mathbb{T}^{\mathcal{A}}_{T}(a_{0})% ]\mathrm{KL}(q_{a_{0}}\ \\|\ q^{a_{0}}_{a_{0}})$
		$\displaystyle=\mathbb{E}_{\nu,\tilde{\pi}}[\mathbb{T}^{\mathcal{A}}_{T}(a_{0})% ]\mathrm{KL}(\mathrm{Ber}(1/2)\ \\|\ \mathrm{Ber}(1/2+4\Delta)).$

By Bretagnolle–Huber inequality,

	$\displaystyle\mathbb{E}_{\nu,\tilde{\pi}}[\mathrm{Reg}(T)]+\mathbb{E}_{\nu^{a_% {0}},\tilde{\pi}}[\mathrm{Reg}(T)]$	$\displaystyle\geq\frac{T\Delta}{2}\mathopen{}\left(\mathbb{P}_{\nu,\tilde{\pi}% }[\mathbb{T}^{\mathcal{A}}_{T}(1)\leq T/2]+\mathbb{P}_{\nu^{a_{0}},\tilde{\pi}% }[\mathbb{T}^{\mathcal{A}}_{T}(1)>T/2]\right)$
		$\displaystyle\geq\frac{T\Delta}{4}\exp(-\mathrm{KL}(\mathbb{P}_{\nu,\tilde{\pi% }}\ \\|\ \mathbb{P}_{\nu^{a_{0}},\tilde{\pi}}))$
		$\displaystyle=\frac{T\Delta}{4}\exp(-\mathbb{E}_{\nu,\tilde{\pi}}[\mathbb{T}^{% \mathcal{A}}_{T}(a_{0})]\mathrm{KL}(\mathrm{Ber}(1/2)\ \\|\ \mathrm{Ber}(1/2+4% \Delta))).$

Now we pick $a_{0}\in\operatorname*{arg\!\min}_{a\neq 1}\mathbb{E}_{\nu,\tilde{\pi}}[% \mathbb{T}^{\mathcal{A}}_{T}(a)]$ which implies that $\mathbb{E}_{\nu,\tilde{\pi}}[\mathbb{T}^{\mathcal{A}}_{T}(a_{0})]\leq\frac{T}{% |\mathcal{A}|-1}$ . Also $\mathrm{KL}(\mathrm{Ber}(1/2)\ \|\ \mathrm{Ber}(1/2+4\Delta))\leq 4(4\Delta)^{% 2}=64\Delta^{2}$ for $\Delta<1/40$ . So

\displaystyle\mathbb{E}_{\nu,\tilde{\pi}}[\mathrm{Reg}(T)]+\mathbb{E}_{\nu^{a_% {0}},\tilde{\pi}}[\mathrm{Reg}(T)]\geq\frac{T\Delta}{4}\exp\mathopen{}\left(-% \frac{64T\Delta^{2}}{|\mathcal{A}|-1}\right).

Taking $\Delta=\frac{1}{40}\sqrt{\frac{|\mathcal{A}|-1}{T}}$ , we know that $\max\mathopen{}\left\{\mathbb{E}_{\nu,\tilde{\pi}}[\mathrm{Reg}(T)],\mathbb{E}% _{\nu^{a_{0}},\tilde{\pi}}[\mathrm{Reg}(T)]\right\}\geq\frac{1}{2}(\mathbb{E}_% {\nu,\tilde{\pi}}[\mathrm{Reg}(T)]+\mathbb{E}_{\nu^{a_{0}},\tilde{\pi}}[% \mathrm{Reg}(T)])\geq c\sqrt{|\mathcal{A}|T}$ for some absolute constant $c>0$ , which yields the claim. ∎

In the above proof, it is easy to see that $\nu(Z)$ and $\nu^{a_{0}}(Z)$ are $\varepsilon-$ close, where $\varepsilon=c\cdot\sqrt{|\mathcal{A}|/T}$ for some absolute constant $c$ . So for any algorithm input by $\tilde{\nu}(Z)=\nu(Z)$ when interacting with $\bar{\nu}\in\mathopen{}\left\{\nu,\nu^{a_{0}},a_{0}\neq 1\right\}$ , it is satisfied that $\tilde{\nu}(Z)$ and $\bar{\nu}(Z)$ are always $\varepsilon-$ close, but the algorithm incurs $\Omega(\sqrt{|\mathcal{A}|T})$ regret in some instance from $\mathopen{}\left\{\nu,\nu^{a_{0}},a_{0}\neq 1\right\}$ .

Appendix E Instances where $\operatorname{PE}$ incurs linear regret

In this section we give an example for $\operatorname{PE}$ to illustrate that to merely force linear regret on a causal bandit algorithm, we need to construct non-benign instances carefully and re-code the algorithm to ensure its erratic behavior in those instances. In particular, we construct a non-benign environment $\nu$ for every $\Delta\in(0,1)$ such that $\Delta_{\min}(\nu)=\Delta$ while the re-coded $\operatorname{PE}$ never plays the optimal arm.

Proposition E.1.

Suppose we modify Algorithm 2 such that, in each phase, we always choose an exact-optimal design whenever feasible. For any $\mathcal{A},\mathcal{Z}$ and $T$ with $|\mathcal{A}|>|\mathcal{Z}|\geq 3$ and $\Delta\in(0,1)$ , there exists a non-benign environment $\nu$ such that $\Delta_{\min}(\nu)=\Delta$ , while Algorithm 2 will never play the optimal arm, hence incurring linear regret,

\displaystyle\mathrm{Reg}(t)\geq\Delta_{\min}(\nu)\cdot t=\Delta\cdot t,\quad% \forall t\in[T].

Proof of Proposition E.1.

For any $\mathcal{A}$ and $\mathcal{Z}$ with $|\mathcal{A}|>|\mathcal{Z}|\geq 3$ , suppose we index the contexts in arbitrary way such that $\mathcal{Z}=\mathopen{}\left\{z_{1},...,z_{|\mathcal{Z}|}\right\}$ , and we pick $(|\mathcal{Z}|+1)$ number of arms from $\mathcal{A}$ and denote them by $a^{*},a_{1},...,a_{|\mathcal{Z}|}$ . Construct marginals $\nu_{a}$ as follows:

	$\displaystyle\nu_{a_{i}}(Z)$	$\displaystyle=\delta_{\mathopen{}\left\{z_{i}\right\}}=:e_{i},\quad i\in[\|% \mathcal{Z}\|],$
	$\displaystyle\nu_{a^{*}}(Z)$	$\displaystyle=\frac{1}{2}(\delta_{\mathopen{}\left\{z_{1}\right\}}+\delta_{% \mathopen{}\left\{z_{2}\right\}})=\frac{1}{2}(e_{1}+e_{2}),$

where we write marginal distributions over $\mathcal{Z}$ as vectors in $\mathbb{R}^{|\mathcal{Z}|}$ according to context indices. Then define conditional distributions $\mathbb{P}_{\nu_{a}}(Y|Z)$ :

\displaystyle\mathbb{P}_{\nu_{a_{i}}}[Y|Z=z_{i}]=\begin{cases}\delta_{% \mathopen{}\left\{0\right\}}&i\in[|\mathcal{Z}|-1]\\ \delta_{\mathopen{}\left\{1-\Delta\right\}}&i=|\mathcal{Z}|\end{cases}

and

\displaystyle\mathbb{P}_{\nu_{a^{*}}}[Y|Z=z_{1}]

\displaystyle=\mathbb{P}_{\nu_{a^{*}}}[Y|Z=z_{2}]=\delta_{\mathopen{}\left\{1% \right\}}.

In other words, playing arm $a_{i}$ yields context $z_{i}$ and deterministic reward, while we could observe $z_{1}$ or $z_{2}$ with equal probability and always get the optimal reward by playing arm $a^{*}$ . So the only optimal arm for $\nu$ is $a^{*}$ with $\Delta_{\min}(\nu)=\Delta$ . We can treat all other $a\in\mathcal{A}$ as dummy actions by identifying each of them with one of $a^{*},a_{i},i\in[|\mathcal{Z}|]$ arbitrarily.

Next we will verify the following facts. (1) When no action is eliminated and $\mathcal{A}_{\ell}=\mathcal{A}$ , any exact G-optimal design $\pi_{\ell}\in\mathcal{P}(\mathcal{A}_{\ell})$ does not have positive mass over $a^{*}$ . (2) Whenever any action is eliminated in the end of phase $\ell$ , it must be that all actions except for $a_{|\mathcal{Z}|}$ are eliminated as well. Then PE would just play $a_{|\mathcal{Z}|}$ till the end. Combining these two facts we can conclude that PE never picks $a^{*}$ during the interaction with $\nu$ .

No G-optimal design is supported on $a^{*}$ .

Recall that any G-optimal design $\pi_{\ell}$ maximizes $f(\pi)=\log\det V(\pi)$ , where $V(\pi)=\sum_{a\in\mathcal{A}_{\ell}}\pi(a)\nu_{a}\nu_{a}^{\top}$ over $\pi\in\mathcal{P}(\mathcal{A}_{\ell})$ (Lattimore & Szepesvári, 2020, Theorem 21.1). When $\mathcal{A}_{\ell}=\mathcal{A}$ , $\det V(\pi)$ can be computed as

\displaystyle\det V(\pi)=\mathopen{}\left(\pi(a_{1})\pi(a_{2})+\frac{\pi(a^{*}% )}{4}(\pi(a_{1})+\pi(a_{2}))\right)\pi(a_{3})\cdots\pi(a_{|\mathcal{Z}|}).

Then we can find that any maximizing $\pi$ should have $\pi(a^{*})=0$ after realizing that $\pi(a_{1})=\pi(a_{2})$ for such $\pi$ . Moreover, there is only one G-optimal design in this case, which is $\pi_{\ell}=Unif(a_{1},...,a_{|\mathcal{Z}|})$ .

All actions other than $a_{|\mathcal{Z}|}$ would be eliminated at the same time.

If the first elimination happens in the end of phase $\ell$ , then we must have $\hat{\mu}^{\mathcal{Z}}_{\ell}=(0,...,0,1-\Delta)^{\top}$ due to that $\pi_{\ell}=Unif(a_{1},...,a_{|\mathcal{Z}|})$ and rewards are deterministic. So $\max_{b\in\mathcal{A}_{\ell}}\langle\hat{\mu}^{\mathcal{Z}}_{\ell},\nu_{b}-\nu% _{a}\rangle$ is $1-\Delta$ for all $a\neq a_{|\mathcal{Z}|}$ and $0$ for $a=a_{|\mathcal{Z}|}$ . Then the elimination must happen within $a^{*},a_{1},...,a_{|\mathcal{Z}|-1}$ , and thus every one of it should be eliminated simultaneously. ∎

	$\displaystyle\mathopen{}\left\|\sum_{t\in\text{phase }\ell}\langle V_{\ell}^{-1% }\tilde{\nu}_{A_{t}},\tilde{\nu}_{a}\rangle b_{A_{t}}\right\|$	$\displaystyle\leq\varepsilon\sum_{t\in\text{phase }\ell}\mathopen{}\left\|% \langle V_{\ell}^{-1}\tilde{\nu}_{A_{t}},\tilde{\nu}_{a}\rangle\right\|$
		$\displaystyle\leq\varepsilon\sqrt{\mathopen{}\left(\sum_{t\in\text{phase }\ell% }1\right)\mathopen{}\left(\sum_{t\in\text{phase }\ell}\langle V_{\ell}^{-1}% \tilde{\nu}_{A_{t}},\tilde{\nu}_{a}\rangle^{2}\right)}$
		$\displaystyle=\varepsilon\sqrt{T_{\ell}\\|\tilde{\nu}_{a}\\|_{V_{\ell}^{-1}}^{2}% }\leq\varepsilon\sqrt{2m_{\ell}\frac{2d_{\tilde{\nu}}}{m_{\ell}}}=2\varepsilon% \sqrt{d_{\tilde{\nu}}}.$

	$\displaystyle\langle\mu^{\mathcal{Z}},\tilde{\nu}_{a^{*}_{\ell}}-\tilde{\nu}_{% a}\rangle$	$\displaystyle=\langle\mu^{\mathcal{Z}}-\hat{\mu}^{\mathcal{Z}}_{\ell},\tilde{% \nu}_{a^{}_{\ell}}-\tilde{\nu}_{a}\rangle+\langle\hat{\mu}^{\mathcal{Z}}_{% \ell},\tilde{\nu}_{a^{}_{\ell}}-\tilde{\nu}_{a}\rangle$
		$\displaystyle\leq 2\mathopen{}\Big{(}\sqrt{\frac{4d_{\tilde{\nu}}}{m_{\ell}}% \log\mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}\right)}+2% \varepsilon\sqrt{d_{\tilde{\nu}}}\Big{)}+2\sqrt{\frac{4d_{\tilde{\nu}}}{m_{% \ell}}\log\mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}\right)}$
		$\displaystyle=4\sqrt{\frac{4d_{\tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(% \frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}\right)}+4\varepsilon\sqrt{d_{\tilde{% \nu}}}.$

	$\displaystyle\mathrm{Reg}(t)$	$\displaystyle\leq\sum_{\ell=1}^{\ell_{\text{max}}(t)}\sum_{a\in\mathcal{A}_{% \ell}}T_{\ell}(a)\cdot\Delta_{a}$
		$\displaystyle\leq 2m_{1}+C\sum_{\ell=2}^{\ell_{\text{max}}(t)}m_{\ell}% \mathopen{}\left(\sqrt{\frac{d_{\tilde{\nu}}}{m_{\ell-1}}\log\mathopen{}\left(% \frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}\right)}+\varepsilon\ell\sqrt{d_{% \tilde{\nu}}}\right)$
		$\displaystyle\leq 2m_{1}+C\sum_{\ell=2}^{\ell_{\text{max}}(t)}\sqrt{m_{\ell}% \cdot d_{\tilde{\nu}}\cdot\log\mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)% }{\delta}\right)}+C\varepsilon\sqrt{d_{\tilde{\nu}}}\sum_{\ell=2}^{\ell_{\text% {max}}(t)}m_{\ell}\ell$
		$\displaystyle\leq 2m_{1}+C\sqrt{m_{\ell_{\text{max}}(t)}\cdot d_{\tilde{\nu}}% \cdot\log\mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}\right)}+C% \varepsilon\sqrt{d_{\tilde{\nu}}}m_{\ell_{\text{max}}(t)}\log_{2}(T)$
		$\displaystyle\leq C\mathopen{}\left(\sqrt{d_{\tilde{\nu}}t\log\mathopen{}\left% (\frac{2\|\mathcal{A}\|\log T}{\delta}\right)}+\varepsilon t\sqrt{d_{\tilde{\nu}% }}\log T\right),$

	$\displaystyle\mathrm{Reg}(T)$	$\displaystyle\leq\sum_{\ell=1}^{\ell_{\text{max}}}\sum_{a\in\mathcal{A}_{\ell}% }T_{\ell}(a)\cdot\Delta_{a}$
		$\displaystyle\leq 2m_{1}+C\sqrt{m_{\ell_{\text{max}}}\cdot d_{\nu}\cdot\log% \mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}\right)}$
		$\displaystyle=2m_{1}+C\sqrt{2^{\ell_{\text{max}}}\cdot m_{1}\cdot d_{\nu}\cdot% \log\mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}\right)}$
		$\displaystyle\leq 2m_{1}+C\sqrt{\frac{d_{\nu}\log\mathopen{}\left(\frac{2\|% \mathcal{A}\|\log_{2}(T)}{\delta}\right)}{m_{1}\Delta_{\min}^{2}}\cdot m_{1}% \cdot d_{\nu}\log\mathopen{}\left(\frac{2\|\mathcal{A}\|\log_{2}(T)}{\delta}% \right)}$
		$\displaystyle\leq C\cdot\frac{d_{\nu}\log\mathopen{}\left(\|\mathcal{A}\|\log T/% \delta\right)}{\Delta_{\min}},$

	$\displaystyle\|M_{t}-M_{t-1}\|$	$\displaystyle=\mathopen{}\left\|\sum_{z\in\mathcal{Z}}\frac{1}{\sqrt{\mathbb{T}% ^{\mathcal{Z}}_{t-1}(z)}}\mathopen{}\left(\mathbb{P}_{\nu_{A_{t}}}[Z=z]-% \mathbb{I}\{Z_{t}=z\}\right)\right\|$
		$\displaystyle=\mathopen{}\left\|\mathbb{E}_{\nu,\pi}\mathopen{}\Big{[}\sum_{z% \in\mathcal{Z}}\frac{1}{\sqrt{\mathbb{T}^{\mathcal{Z}}_{t-1}(z)}}\mathbb{I}\{Z% _{t}=z\}\|A_{t},H_{t-1}\Big{]}-\sum_{z\in\mathcal{Z}}\frac{1}{\sqrt{\mathbb{T}^% {\mathcal{Z}}_{t-1}(z)}}\mathbb{I}\{Z_{t}=z\}\right\|$
		$\displaystyle=\mathopen{}\left\|\mathbb{E}_{\nu,\pi}\mathopen{}\Big{[}\frac{1}{% \sqrt{\mathbb{T}^{\mathcal{Z}}_{t-1}(Z_{t})}}\|A_{t},H_{t-1}\Big{]}-\frac{1}{% \sqrt{\mathbb{T}^{\mathcal{Z}}_{t-1}(Z_{t})}}\right\|$
		$\displaystyle\leq 1.$

Causal Bandits: The Pareto Optimal Frontier of Adaptivity, a Reduction to Linear Bandits, and Limitations around Unknown Marginals

Abstract

1 Introduction

1.1 Related Work

Causal bandits.

Model selection.

Pareto optimal frontier.

2 Problem Setup

Conditionally benign property and d𝑑ditalic_d-separation.

Definition 2.1.

2.1 Adaptivity

Remark 2.2.

Generic algorithms.

3 The Pareto Regret Frontier

Definition 3.1.

Theorem 3.2.

3.1 Upper Bounds

Definition 3.3.

Proposition 3.4.

Theorem 3.5.

Corollary 3.6.

3.2 Lower Bounds

Theorem 3.7.

4 Instance-Dependent Bounds via Phased Elimination Algorithm

4.1 Reduction to Linear Bandits

4.2 Phased Elimination and its Regret Bound

Theorem 4.1 (Worst-case regret bound for PEPE\operatorname{PE}roman_PE).

Corollary 4.2.

Theorem 4.3 (Instance-dependent regret bound for PEPE\operatorname{PE}roman_PE).

Remark 4.4.

4.3 Roadblocks: Instance-Dependent Bounds

5 Limited Knowledge of the Marginal Distributions over Context Variables

Definition 5.1.

Theorem 5.2.

Remark 5.3.

5.1 Phased Elimination with Approximate Marginals

Definition 5.4.

Theorem 5.5 (Worst-case regret bound, with approximate marginal distributions).

6 Conclusions and Discussions

Acknowledgements

References

Appendix A Regret Analysis for Dynamic Balancing

Notations.

A.1 Preliminaries

Proposition A.1 (Adapted version of Theorem 22 in Cutkosky et al. (2021)).

A.2 Proof of Theorem 3.5

Proposition A.2.

Proof of Proposition A.2.

Appendix B Regret analysis of phased elimination

B.1 Prerequisite

Lemma B.1.

Proof of Lemma B.1.

Lemma B.2.

Proof of Lemma B.2.

Lemma B.3.

Proof of Lemma B.3.

Corollary B.4.

Proof of Corollary B.4.

B.2 Proof of Theorem 5.5

Proof of Theorem 5.5.

B.3 Proof of Theorem 4.3

Proof of Theorem 4.3.

Appendix C Anytime Regret Bounds for UCB and C-UCB

C.1 Preliminaries

Lemma C.1 (Lemma B.1 and B.2 in Bilodeau et al. 2022).

Lemma C.2.

Proof of Lemma C.2.

C.2 Anytime High-probability Regret Bound

Theorem C.3.

Proof of Theorem C.3.

Theorem C.4.

Proof of Theorem C.4.

Appendix D Proofs of Lower Bounds

D.1 Proof of Theorem 3.7

Proof of Theorem 3.7.

D.2 Proof of Theorem 5.2

Proof of Theorem 5.2.

Appendix E Instances where PEPE\operatorname{PE}roman_PE incurs linear regret

Proposition E.1.

Proof of Proposition E.1.

Conditionally benign property and $d$ -separation.

Theorem 4.1 (Worst-case regret bound for $\operatorname{PE}$ ).

Theorem 4.3 (Instance-dependent regret bound for $\operatorname{PE}$ ).

Appendix E Instances where $\operatorname{PE}$ incurs linear regret

No G-optimal design is supported on $a^{*}$ .

All actions other than $a_{|\mathcal{Z}|}$ would be eliminated at the same time.