Causal Bandits: The Pareto Optimal Frontier of Adaptivity, a Reduction to Linear Bandits, and Limitations around Unknown Marginals

Ziyi Liu Department of Statistical Sciences, University of Toronto and Vector Institute; [email protected].    Idan Attias Department of Computer Science, Ben-Gurion University and Vector Institute; [email protected].    Daniel M. Roy Department of Statistical Sciences, University of Toronto and Vector Institute; [email protected].
Abstract

In this work, we investigate the problem of adapting to the presence or absence of causal structure in multi-armed bandit problems. In addition to the usual reward signal, we assume the learner has access to additional variables, observed in each round after acting. When these variables d𝑑ditalic_d-separate the action from the reward, existing work in causal bandits demonstrates that one can achieve strictly better (minimax) rates of regret (Lu et al., 2020). Our goal is to adapt to this favorable “conditionally benign” structure, if it is present in the environment, while simultaneously recovering worst-case minimax regret, if it is not. Notably, the learner has no prior knowledge of whether the favorable structure holds. In this paper, we establish the Pareto optimal frontier of adaptive rates. We prove upper and matching lower bounds on the possible trade-offs in the performance of learning in conditionally benign and arbitrary environments, resolving an open question raised by Bilodeau et al. (2022). Furthermore, we are the first to obtain instance-dependent bounds for causal bandits, by reducing the problem to the linear bandit setting. Finally, we examine the common assumption that the marginal distributions of the post-action contexts are known and show that a nontrivial estimate is necessary for better-than-worst-case minimax rates.

$\star$$\star$footnotetext: Equal contribution.

1 Introduction

In real-world decision making, we often want strong worst-case guarantees as well as the ability to adapt to favorable properties of real-world scenarios. Adaptive sequential decision-making offers a framework to design algorithms to achieve these objectives.

In this paper, we explore adaptivity in multi-armed bandit problems. In standard multi-armed bandits, the learner (policy) takes an action, receives a reward, and then this process repeats over a number of rounds. The learner’s regret is the difference between its cumulative reward and the cumulative reward of the single best action in hindsight. Can we work to identify high-reward actions while minimizing regret?

In this work, we assume there is post-action context, i.e., there may be additional information available to the learner after taking an action, beyond the reward signal. In a worst-case analysis, however, the learner can ignore the post-action context and still achieve minimax rates of regret: the worst-case environment will not offer useful information. However, many real-world settings possess the structure of multi-armed bandit problems with post-action context and, in those cases, this additional information is useful towards minimizing regret.

One way that post-action context can be useful is if we can assume causal structure relating the action (i.e., an intervention) to the reward and post-action (post-intervention) context. Several authors have studied models in this vein (Bareinboim et al., 2015; Lattimore et al., 2016). In this work, we build on the framework of Lattimore et al. (2016), wherein the post-action context is assumed to d𝑑ditalic_d-separate each intervention from its associated reward.

Under d𝑑ditalic_d-separation, the intervention and reward are independent, conditional on the post-intervention context. Bilodeau et al. (2022) formalized this structure in general terms: a bandit environment is conditionally benign whenever the conditional distribution of the reward, given the post-action context, does not depend on the action.

Minimax regret is well understood for both the classical and causal variant of multi-armed bandits. Notably, algorithms tailored to conditionally benign environments can achieve lower rates of regret, scaling with the number of post-action contexts, rather than the potentially much larger set of actions (Lu et al., 2020; Bilodeau et al., 2022).

Exploiting causal structure is not without its pitfalls. Bilodeau et al. (2022) showed that C-UCB, a minimax optimal causal bandit algorithm, suffers linear regret in some non-benign environments. This raised a natural question: Can we achieve strict adaptivity, i.e., obtain minimax rates simultaneously in the class of conditionally benign environments and in the class of all environments, without knowing in advance which class of environments we will face?

Bilodeau et al. proved that strict adaptivity was impossible, but showed some level of adaptivity was possible. They designed a new algorithm, termed HAC-UCB, and proved that it simultaneously achieves minimax optimal rates on the class of benign environments and always achieves (suboptimal, though sublinear) T3/4superscript𝑇34T^{3/4}italic_T start_POSTSUPERSCRIPT 3 / 4 end_POSTSUPERSCRIPT rates. In light of this result, Bilodeau et al. raised an open problem, asking whether HAC-UCB was, in a sense, Pareto optimal, implying that the slower rate was the price of adaptivity. More generally, we ask:

What is the Pareto optimal frontier of simultaneously achievable rates of regret in the classes of benign and arbitrary environments, and what algorithms achieve these optimal tradeoffs?

In this paper, we address the above question by providing a complete characterization of the Pareto optimal frontier (up to log factors) as well as the achieving algorithms. Besides adaptation, we also study the complexity of causal bandit problems from other perspectives. More specifically, we find a novel reduction from causal bandits to linear bandits, which facilitates the first instance-dependent regret bound for causal bandits and enables the applications of some linear bandit algorithms to causal bandits. We also investigate drop** the common assumption that we have perfect knowledge of “the marginals”, i.e., the distribution of the post-action context variable, under each action. On one hand, we show that it is impossible for any algorithm to enjoy improved minimax regret in benign environments without any knowledge of the true marginals. On the other hand, we identify cases where approximate knowledge of the marginal distributions suffices. Our contributions are explained in more details as follows.

  • In Section 3, we establish near-optimal Pareto regret frontiers for the setting of causal bandits, resolving an open problem raised by Bilodeau et al. (2022), see Figure 1. Utilizing a dynamic balancing method introduced by Cutkosky et al. (2021), we derive the upper bound and also prove near-optimal matching lower bounds. Remarkably, we introduce a phenomenon we call the price of adaptivity, to capture the extra regret that one must incur when attempting to adapt to the presence or lack of causal structure. Consequently, we demonstrate that the model selection method introduced by Cutkosky et al. (2021) cannot be generally improved, for any nontrivial general improvement would decrease the price of adaptivity beyond our lower bound.

  • In Section 4, we present a novel reduction from causal bandits to linear bandits with conditional sub-Gaussian noise. Utilizing a phased elimination technique (Lattimore et al., 2020), we identify a new dimension measuring the inherent complexity of causal bandits. It allows us to establish the first instance-dependent regret bound and a strictly tighter worst-case regret bound for causal bandits for conditionally benign environments. Additionally, we prove instance-dependent bounds for stochastic linear bandits, which are novel to the best of our knowledge.

  • In Section 5, we study the situation where we have limited knowledge of the marginal distributions over post-action contexts. We provide a lower bound indicating that no algorithm can utilize the causal structure to achieve improved minimax rates without such prior knowledge. This partly justifies the common assumption in the causal bandits literature that algorithms are given the marginals. On the other side, we give a regret upper bound for the phased elimination algorithm with access to approximate marginals. This result shows that partial knowledge of the marginals suffices in some regimes.

Refer to caption
Figure 1: The Pareto-optimal frontier of simultaneously achievable rates of regret in (left axis) the class of conditionally benign environments and (bottom axis) the class of all environments. Shaded regions are unobtainable. All rates are determined up to log terms. Among algorithms that achieve minimax rates on conditionally benign environments, the previously best known algorithm (HAC-UCB) is dominated by an instance of Dynamic Balancing, which our results also demonstrate is Pareto optimal.

1.1 Related Work

Causal bandits.

The causal bandit model was introduced by Lattimore et al. (2016), where their objective was to identify the best intervention. Such pure exploration problem has been extensively studied since then (Sen et al., 2017; Xiong & Chen, 2022), while some other works focused on regret minimization (Lu et al., 2020; Nair et al., 2021; Bilodeau et al., 2022). Another interesting topic is to relax the causal assumptions. For example, the assumption of known causal graph can be relaxed (Lu et al., 2021; Malek et al., 2023). Our work mainly builds on the study by Bilodeau et al. (2022) regarding adapting to the existence of causal structures as well as approximate marginals.

Model selection.

To achieve adaptivity, a natural idea is to apply some model selection algorithm on top of a group of base learners. There is an extending line of works studying such corralling strategies in the bandit setting (Agarwal et al., 2017; Pacchiano et al., 2020a, b; Cutkosky et al., 2020; Arora et al., 2021; Cutkosky et al., 2021). Agarwal et al. (2017) required certain stability conditions on the base learners, making their algorithm quite restricted. In contrast, some recently proposed general-purpose model selection algorithms for stochastic bandit problems (Pacchiano et al., 2020b; Cutkosky et al., 2020, 2021) are better candidates in our setting, since they only necessitate mild assumptions on the base learners.

Pareto optimal frontier.

When we have multiple performance metrics but are unable to achieve the best under all of them simultaneously, the Pareto optimal frontier becomes a common objective to pursue subsequently. Problems with several competing benchmarks are abundant in bandit literature (Koolen, 2013; Lattimore, 2015; Marinov & Zimmert, 2021; Zhu & Nowak, 2022).

2 Problem Setup

We consider the problem of stochastic bandit with post-action contexts, as defined by Bilodeau et al. (2022) and follow their notations. Let 𝒜𝒜\mathcal{A}caligraphic_A be the finite action space, 𝒵𝒵\mathcal{Z}caligraphic_Z be the finite context space and 𝒴=[0,1]𝒴01\mathcal{Y}=[0,1]caligraphic_Y = [ 0 , 1 ] be the reward space. For any set K𝐾Kitalic_K, we use 𝒫(K)𝒫𝐾\mathcal{P}(K)caligraphic_P ( italic_K ) to denote the set of all probability distributions supported on K𝐾Kitalic_K. For any p𝒫(𝒵×𝒴)𝑝𝒫𝒵𝒴p\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})italic_p ∈ caligraphic_P ( caligraphic_Z × caligraphic_Y ), we use p(Z)𝑝𝑍p(Z)italic_p ( italic_Z ) to denote its marginal distribution over 𝒵𝒵\mathcal{Z}caligraphic_Z, and use p(Y|Z)𝑝conditional𝑌𝑍p(Y|Z)italic_p ( italic_Y | italic_Z ) to denote its the conditional distribution over 𝒴𝒴\mathcal{Y}caligraphic_Y conditioning on the Zlimit-from𝑍Z-italic_Z -component.

In this bandit problem, a learner interacts with the stochastic environment for T𝑇Titalic_T rounds. The role of the environment is instantiated with a family of distributions ν={νa:a𝒜}𝒫(𝒵×𝒴)𝒜𝜈conditional-setsubscript𝜈𝑎𝑎𝒜𝒫superscript𝒵𝒴𝒜\nu=\{\nu_{a}:a\in\mathcal{A}\}\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{% \mathcal{A}}italic_ν = { italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT : italic_a ∈ caligraphic_A } ∈ caligraphic_P ( caligraphic_Z × caligraphic_Y ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT indexed by actions in 𝒜𝒜\mathcal{A}caligraphic_A. For each round t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], the learner picks an action Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from 𝒜𝒜\mathcal{A}caligraphic_A and then receives a context-reward pair (Zt,Yt)subscript𝑍𝑡subscript𝑌𝑡(Z_{t},Y_{t})( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) which is independently sampled from νAt𝒫(𝒵×𝒴)subscript𝜈subscript𝐴𝑡𝒫𝒵𝒴\nu_{A_{t}}\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})italic_ν start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_P ( caligraphic_Z × caligraphic_Y ).

To model learner’s strategy, we need to formalize the information that can be used for learner’s prediction. Let Ht=(As,Zs,Ys)s[t]subscript𝐻𝑡subscriptsubscript𝐴𝑠subscript𝑍𝑠subscript𝑌𝑠𝑠delimited-[]𝑡H_{t}=(A_{s},Z_{s},Y_{s})_{s\in[t]}italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT denote the observed history up to round t𝑡titalic_t, which is a random variable valued in t:=(𝒜×𝒵×𝒴)tassignsubscript𝑡superscript𝒜𝒵𝒴𝑡\mathcal{H}_{t}:=(\mathcal{A}\times\mathcal{Z}\times\mathcal{Y})^{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ( caligraphic_A × caligraphic_Z × caligraphic_Y ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. A policy π𝜋\piitalic_π by the learner could be modeled as a sequence of measurable maps from tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’s to 𝒜𝒜\mathcal{A}caligraphic_A

π=(πt)t[T]Π(𝒜,𝒵,T):=t=1T{t1𝒜},𝜋subscriptsubscript𝜋𝑡𝑡delimited-[]𝑇Π𝒜𝒵𝑇assignsuperscriptsubscriptproduct𝑡1𝑇subscript𝑡1𝒜\displaystyle\pi=(\pi_{t})_{t\in[T]}\in\Pi(\mathcal{A},\mathcal{Z},T):=\prod_{% t=1}^{T}\{\mathcal{H}_{t-1}\to\mathcal{A}\},italic_π = ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT ∈ roman_Π ( caligraphic_A , caligraphic_Z , italic_T ) := ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT { caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT → caligraphic_A } ,

where Π(𝒜,𝒵,T)Π𝒜𝒵𝑇\Pi(\mathcal{A},\mathcal{Z},T)roman_Π ( caligraphic_A , caligraphic_Z , italic_T ) is the space of all policies compatible with (𝒜,𝒵,T)𝒜𝒵𝑇(\mathcal{A},\mathcal{Z},T)( caligraphic_A , caligraphic_Z , italic_T ). Then the learner follows this policy by selecting At=πt(Ht1)subscript𝐴𝑡subscript𝜋𝑡subscript𝐻𝑡1A_{t}=\pi_{t}(H_{t-1})italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) for each round t𝑡titalic_t. Indeed, the distribution of all outcomes over T𝑇Titalic_T rounds, i.e. (At,Zt,Yt)t[T]subscriptsubscript𝐴𝑡subscript𝑍𝑡subscript𝑌𝑡𝑡delimited-[]𝑇(A_{t},Z_{t},Y_{t})_{t\in[T]}( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT, is determined by the environment ν𝜈\nuitalic_ν and the player’s policy π𝜋\piitalic_π together. We will always highlight the ambient joint distribution by the subscript on probabilistic operators \mathbb{P}blackboard_P and 𝔼𝔼\mathbb{E}blackboard_E, say 𝔼νasubscript𝔼subscript𝜈𝑎\mathbb{E}_{\nu_{a}}blackboard_E start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝔼ν,πsubscript𝔼𝜈𝜋\mathbb{E}_{\nu,\pi}blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT. Additionally, we denote the expected reward for action a𝑎aitalic_a and the optimal action asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by

μ𝒜(a):=𝔼νa[Y],a:=argmaxa𝒜μ𝒜(a).formulae-sequenceassignsuperscript𝜇𝒜𝑎subscript𝔼subscript𝜈𝑎delimited-[]𝑌assignsuperscript𝑎subscriptarg𝑎𝒜superscript𝜇𝒜𝑎\displaystyle\mu^{\mathcal{A}}(a):=\mathbb{E}_{\nu_{a}}[Y],\quad a^{*}:=% \operatorname*{arg\!\max}_{a\in\mathcal{A}}\mu^{\mathcal{A}}(a).italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a ) := blackboard_E start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y ] , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a ) . (1)

The goal of the learner is to choose some policy π𝜋\piitalic_π that maximizes her expected cumulative reward 𝔼ν,π[t=1TYt]subscript𝔼𝜈𝜋delimited-[]superscriptsubscript𝑡1𝑇subscript𝑌𝑡\mathbb{E}_{\nu,\pi}[\sum_{t=1}^{T}Y_{t}]blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ], or equivalently minimizes her expected pseudo-regret

𝔼ν,π[Reg(T)]:=𝔼ν,π[t=1Tmaxa𝒜𝔼νa[Y]Yt]=Tμ𝒜(a)𝔼ν,π[t=1Tμ𝒜(At)],assignsubscript𝔼𝜈𝜋delimited-[]Reg𝑇subscript𝔼𝜈𝜋delimited-[]superscriptsubscript𝑡1𝑇subscript𝑎𝒜subscript𝔼subscript𝜈𝑎delimited-[]𝑌subscript𝑌𝑡𝑇superscript𝜇𝒜superscript𝑎subscript𝔼𝜈𝜋delimited-[]superscriptsubscript𝑡1𝑇superscript𝜇𝒜subscript𝐴𝑡\displaystyle\mathbb{E}_{\nu,\pi}[\mathrm{Reg}(T)]:=\mathbb{E}_{\nu,\pi}[\sum_% {t=1}^{T}\max_{a\in\mathcal{A}}\mathbb{E}_{\nu_{a}}[Y]-Y_{t}]=T\cdot\mu^{% \mathcal{A}}(a^{*})-\mathbb{E}_{\nu,\pi}[\sum_{t=1}^{T}\mu^{\mathcal{A}}(A_{t}% )],blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] := blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y ] - italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = italic_T ⋅ italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ,

with Reg(t):=tμ𝒜(a)s=1tμ𝒜(As),t[T]formulae-sequenceassignReg𝑡𝑡superscript𝜇𝒜superscript𝑎superscriptsubscript𝑠1𝑡superscript𝜇𝒜subscript𝐴𝑠𝑡delimited-[]𝑇\mathrm{Reg}(t):=t\cdot\mu^{\mathcal{A}}(a^{*})-\sum_{s=1}^{t}\mu^{\mathcal{A}% }(A_{s}),t\in[T]roman_Reg ( italic_t ) := italic_t ⋅ italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_t ∈ [ italic_T ] being the realized regret, which is stochastic.

Conditionally benign property and d𝑑ditalic_d-separation.

Under certain structures, the post-action context variable Z𝑍Zitalic_Z enables more efficient exploration and hence smaller regret. One special structure that can be exploited for better regret guarantee in our setting is called conditionally benign property, introduced by Bilodeau et al. (2022).

Definition 2.1.

(Bilodeau et al., 2022, Definition 3.1) An environment ν𝒫(𝒵×𝒴)𝒜𝜈𝒫superscript𝒵𝒴𝒜\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}italic_ν ∈ caligraphic_P ( caligraphic_Z × caligraphic_Y ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT is conditionally benign if and only if there exists p𝒫(𝒵×𝒴)𝑝𝒫𝒵𝒴p\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})italic_p ∈ caligraphic_P ( caligraphic_Z × caligraphic_Y ) such that for each a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, νa(Z)p(Z)much-less-thansubscript𝜈𝑎𝑍𝑝𝑍\nu_{a}(Z)\ll p(Z)italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_Z ) ≪ italic_p ( italic_Z ) and νa(Y|Z)=p(Y|Z)subscript𝜈𝑎conditional𝑌𝑍𝑝conditional𝑌𝑍\nu_{a}(Y|Z)=p(Y|Z)italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_Y | italic_Z ) = italic_p ( italic_Y | italic_Z ) p-a.s. We further denote the space of all conditionally benign environments by 𝒫Benign(𝒵×𝒴)𝒜subscript𝒫Benignsuperscript𝒵𝒴𝒜\mathcal{P}_{\mathrm{Benign}}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}caligraphic_P start_POSTSUBSCRIPT roman_Benign end_POSTSUBSCRIPT ( caligraphic_Z × caligraphic_Y ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT.

The conditional benign property is quite general in the sense that it is equivalent to or weaker than some well-studied causal assumptions (Bilodeau et al. 2022). In particular, the conditionally benign property is the same thing as the context variable Z𝑍Zitalic_Z being a d𝑑ditalic_d-separator when 𝒜𝒜\mathcal{A}caligraphic_A is all interventions. To leverage this benign structure, the causal UCB (CUCBCUCB\operatorname{C-UCB}roman_C - roman_UCB) algorithm recently proposed by Lu et al. (2020) achieves O~(|𝒵|T)~𝑂𝒵𝑇\tilde{O}(\sqrt{|\mathcal{Z}|T})over~ start_ARG italic_O end_ARG ( square-root start_ARG | caligraphic_Z | italic_T end_ARG ) regret, while non-causal algorithms that is unaware of this structure would still incur the possibly worse regret of O~(|𝒜|T)~𝑂𝒜𝑇\tilde{O}(\sqrt{|\mathcal{A}|T})over~ start_ARG italic_O end_ARG ( square-root start_ARG | caligraphic_A | italic_T end_ARG ).

2.1 Adaptivity

A natural question is whether we can compete with C-UCB when the environment is conditionally benign while at the same time still maintain the worst-case O~(|𝒜|T)~𝑂𝒜𝑇\tilde{O}({\sqrt{|\mathcal{A}|T}})over~ start_ARG italic_O end_ARG ( square-root start_ARG | caligraphic_A | italic_T end_ARG ) regret guarantee, without prior knowledge of the nature of the environment. Unfortunately algorithms designed specific to the benign setting may fail drastically in non-benign settings. For instance, C-UCB provably incurs linear regret in some non-benign environments (Bilodeau et al., 2022). To remedy this, Bilodeau et al. (2022) devised HACUCBHACUCB\operatorname{HAC-UCB}roman_HAC - roman_UCB by adding a hypothesis test in each round, which is used for switching away from C-UCB to UCB irreversibly whenever it detects a deviation from conditionally benign property. HACUCBHACUCB\operatorname{HAC-UCB}roman_HAC - roman_UCB is able to recover the O~(|𝒵|T)~𝑂𝒵𝑇\tilde{O}(\sqrt{|\mathcal{Z}|T})over~ start_ARG italic_O end_ARG ( square-root start_ARG | caligraphic_Z | italic_T end_ARG ) regret in benign settings and achieve sublinear O~(T3/4)~𝑂superscript𝑇34\tilde{O}(T^{3/4})over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 3 / 4 end_POSTSUPERSCRIPT ) regret in the worst case.

Prior to this work, we do not know if HACUCBHACUCB\operatorname{HAC-UCB}roman_HAC - roman_UCB is optimal. Indeed, Bilodeau et al. (2022) showed that strict adaptation, meaning that always achieving the worst-case O(|𝒜|T)𝑂𝒜𝑇O(\sqrt{|\mathcal{A}|T})italic_O ( square-root start_ARG | caligraphic_A | italic_T end_ARG ) regret while still being able to perform as good as CUCBCUCB\operatorname{C-UCB}roman_C - roman_UCB when causal structure exists, is impossible. But this does not rule out the possibility of improving the worst-case O~(T3/4)~𝑂superscript𝑇34\tilde{O}(T^{3/4})over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 3 / 4 end_POSTSUPERSCRIPT ) regret of HACUCBHACUCB\operatorname{HAC-UCB}roman_HAC - roman_UCB unilaterally. In this paper we will show that such improvement is indeed feasible and thus obtain an algorithm that dominates HACUCBHACUCB\operatorname{HAC-UCB}roman_HAC - roman_UCB. Further we will show that our regret guarantee is not improvable through the lens of Pareto optimality.

Remark 2.2.

Regarding optimal rate of regret under the presence of causal structure, it is easy to show a Ω(|𝒵|T)Ω𝒵𝑇\Omega(\sqrt{|\mathcal{Z}|T})roman_Ω ( square-root start_ARG | caligraphic_Z | italic_T end_ARG ) regret lower bound, nearly matching existing O~(|𝒵|T)~𝑂𝒵𝑇\tilde{O}(\sqrt{|\mathcal{Z}|T})over~ start_ARG italic_O end_ARG ( square-root start_ARG | caligraphic_Z | italic_T end_ARG ) regret upper bounds. Whether the log-factors can be shaved from the upper bound is unknown. However, the lower bound of Bilodeau et al. (2022) still implies that strict adaptation is impossible for general 𝒜𝒜\mathcal{A}caligraphic_A and 𝒵𝒵\mathcal{Z}caligraphic_Z, since when |𝒜|/|𝒵|𝒜𝒵|\mathcal{A}|/|\mathcal{Z}|| caligraphic_A | / | caligraphic_Z | is, say, Ω(T1/5)Ωsuperscript𝑇15\Omega(T^{1/5})roman_Ω ( italic_T start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT ), the Ω(|𝒜|T)Ω𝒜𝑇\Omega(\sqrt{|\mathcal{A}|T})roman_Ω ( square-root start_ARG | caligraphic_A | italic_T end_ARG ) lower bound in benign settings rules out a O~(|𝒵|T)~𝑂𝒵𝑇\tilde{O}(\sqrt{|\mathcal{Z}|T})over~ start_ARG italic_O end_ARG ( square-root start_ARG | caligraphic_Z | italic_T end_ARG ) upper bound.

Generic algorithms.

For rigorous treatment of adaptivity, we adopt the definition of algorithms as maps from Bilodeau et al. (2022). Specifically, an algorithm 𝔞𝔞\mathfrak{a}fraktur_a is any map from problem-specific inputs to the space of compatible policies

𝔞:(𝒜,𝒵,T,q)𝔞(𝒜,𝒵,T,q)Π(𝒜,𝒵,T),:𝔞maps-to𝒜𝒵𝑇𝑞𝔞𝒜𝒵𝑇𝑞Π𝒜𝒵𝑇\displaystyle\mathfrak{a}:(\mathcal{A},\mathcal{Z},T,q)\mapsto\mathfrak{a}(% \mathcal{A},\mathcal{Z},T,q)\in\Pi(\mathcal{A},\mathcal{Z},T),fraktur_a : ( caligraphic_A , caligraphic_Z , italic_T , italic_q ) ↦ fraktur_a ( caligraphic_A , caligraphic_Z , italic_T , italic_q ) ∈ roman_Π ( caligraphic_A , caligraphic_Z , italic_T ) ,

where q𝒫(𝒵)𝒜𝑞𝒫superscript𝒵𝒜q\in\mathcal{P}(\mathcal{Z})^{\mathcal{A}}italic_q ∈ caligraphic_P ( caligraphic_Z ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT is the marginal distribution accessed by this algorithm as prior knowledge. When talking about algorithm-induced policies, by default we mean 𝔞(𝒜,𝒵,T,ν(Z))𝔞𝒜𝒵𝑇𝜈𝑍\mathfrak{a}(\mathcal{A},\mathcal{Z},T,\nu(Z))fraktur_a ( caligraphic_A , caligraphic_Z , italic_T , italic_ν ( italic_Z ) ) if not stated otherwise, following the common assumption in the literature of causal bandits. We will also deal with the case of imperfect prior knowledge in Section 5, where q𝑞qitalic_q may not be the exact ν(Z)𝜈𝑍\nu(Z)italic_ν ( italic_Z ). For notation simplicity, we will use 𝔞𝔞\mathfrak{a}fraktur_a to denote its induced policy 𝔞(𝒜,𝒵,T,q)𝔞𝒜𝒵𝑇𝑞\mathfrak{a}(\mathcal{A},\mathcal{Z},T,q)fraktur_a ( caligraphic_A , caligraphic_Z , italic_T , italic_q ) when the problem-specific inputs are clear from context. For example, 𝔼ν,𝔞subscript𝔼𝜈𝔞\mathbb{E}_{\nu,\mathfrak{a}}blackboard_E start_POSTSUBSCRIPT italic_ν , fraktur_a end_POSTSUBSCRIPT is the same thing as 𝔼ν,𝔞(𝒜,𝒵,T,q)subscript𝔼𝜈𝔞𝒜𝒵𝑇𝑞\mathbb{E}_{\nu,\mathfrak{a}(\mathcal{A},\mathcal{Z},T,q)}blackboard_E start_POSTSUBSCRIPT italic_ν , fraktur_a ( caligraphic_A , caligraphic_Z , italic_T , italic_q ) end_POSTSUBSCRIPT.

3 The Pareto Regret Frontier

To formalize our notion of Pareto regret frontier, we need the following definition:

Definition 3.1.

A pair of rate functions (R1(T;𝒜,𝒵),R2(T;𝒜,𝒵))subscript𝑅1𝑇𝒜𝒵subscript𝑅2𝑇𝒜𝒵(R_{1}(T;\mathcal{A},\mathcal{Z}),R_{2}(T;\mathcal{A},\mathcal{Z}))( italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ; caligraphic_A , caligraphic_Z ) , italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ; caligraphic_A , caligraphic_Z ) ) is said to be realizable if there is an algorithm 𝔞𝔞\mathfrak{a}fraktur_a such that for all 𝒜,𝒵𝒜𝒵\mathcal{A},\mathcal{Z}caligraphic_A , caligraphic_Z and T𝑇Titalic_T,

supν𝒫Benign(𝒵×𝒴)𝒜𝔼ν,𝔞[Reg(T)]R1(T;𝒜,𝒵),subscriptsupremum𝜈subscript𝒫Benignsuperscript𝒵𝒴𝒜subscript𝔼𝜈𝔞delimited-[]Reg𝑇subscript𝑅1𝑇𝒜𝒵\displaystyle\sup_{\nu\in\mathcal{P}_{\mathrm{Benign}}(\mathcal{Z}\times% \mathcal{Y})^{\mathcal{A}}}\mathbb{E}_{\nu,\mathfrak{a}}[\mathrm{Reg}(T)]\leq R% _{1}(T;\mathcal{A},\mathcal{Z}),roman_sup start_POSTSUBSCRIPT italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT roman_Benign end_POSTSUBSCRIPT ( caligraphic_Z × caligraphic_Y ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_ν , fraktur_a end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] ≤ italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ; caligraphic_A , caligraphic_Z ) ,
supν𝒫(𝒵×𝒴)𝒜𝔼ν,𝔞[Reg(T)]R2(T;𝒜,𝒵).subscriptsupremum𝜈𝒫superscript𝒵𝒴𝒜subscript𝔼𝜈𝔞delimited-[]Reg𝑇subscript𝑅2𝑇𝒜𝒵\displaystyle\sup_{\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A% }}}\mathbb{E}_{\nu,\mathfrak{a}}[\mathrm{Reg}(T)]\leq R_{2}(T;\mathcal{A},% \mathcal{Z}).roman_sup start_POSTSUBSCRIPT italic_ν ∈ caligraphic_P ( caligraphic_Z × caligraphic_Y ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_ν , fraktur_a end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] ≤ italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ; caligraphic_A , caligraphic_Z ) .

A pair (R1(T;𝒜,𝒵),R2(T;𝒜,𝒵))subscript𝑅1𝑇𝒜𝒵subscript𝑅2𝑇𝒜𝒵(R_{1}(T;\mathcal{A},\mathcal{Z}),R_{2}(T;\mathcal{A},\mathcal{Z}))( italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ; caligraphic_A , caligraphic_Z ) , italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ; caligraphic_A , caligraphic_Z ) ) is reasonable if R1(T;𝒜,𝒵)|𝒵|Tsubscript𝑅1𝑇𝒜𝒵𝒵𝑇R_{1}(T;\mathcal{A},\mathcal{Z})\geq\sqrt{|\mathcal{Z}|T}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ; caligraphic_A , caligraphic_Z ) ≥ square-root start_ARG | caligraphic_Z | italic_T end_ARG and R2(T;𝒜,𝒵)|𝒜|Tsubscript𝑅2𝑇𝒜𝒵𝒜𝑇R_{2}(T;\mathcal{A},\mathcal{Z})\geq\sqrt{|\mathcal{A}|T}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ; caligraphic_A , caligraphic_Z ) ≥ square-root start_ARG | caligraphic_A | italic_T end_ARG.

In the following we elide the dependence of rates Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on 𝒜𝒜\mathcal{A}caligraphic_A and 𝒵𝒵\mathcal{Z}caligraphic_Z below for clarity. We can now describe the Pareto regret frontier, i.e., the set of optimal realizable pairs of rates.

Theorem 3.2.

There exists universal constants C,c,c>0𝐶𝑐superscript𝑐0C,c,c^{\prime}>0italic_C , italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0 such that

  1. 1.

    Upper bound: If (R1(T),R2(T))subscript𝑅1𝑇subscript𝑅2𝑇(R_{1}(T),R_{2}(T))( italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) , italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) ) is reasonable and R1(T)R2(T)|𝒜|Tsubscript𝑅1𝑇subscript𝑅2𝑇𝒜𝑇R_{1}(T)R_{2}(T)\geq|\mathcal{A}|Titalic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) ≥ | caligraphic_A | italic_T, then (CR1(T)logT,CR2(T)logT)𝐶subscript𝑅1𝑇𝑇𝐶subscript𝑅2𝑇𝑇(CR_{1}(T)\log T,CR_{2}(T)\log T)( italic_C italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) roman_log italic_T , italic_C italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) roman_log italic_T ) is realizable;

  2. 2.

    Lower bound: For all realizable (R1(T),R2(T))subscript𝑅1𝑇subscript𝑅2𝑇(R_{1}(T),R_{2}(T))( italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) , italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) ), we have R2(T)>cTsubscript𝑅2𝑇superscript𝑐𝑇R_{2}(T)>c^{\prime}Titalic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) > italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T or R1(T)R2(T)c|𝒜|Tsubscript𝑅1𝑇subscript𝑅2𝑇𝑐𝒜𝑇R_{1}(T)R_{2}(T)\geq c|\mathcal{A}|Titalic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) ≥ italic_c | caligraphic_A | italic_T.

Both upper and lower bounds will be extensively discussed in the following sections.

3.1 Upper Bounds

In this section, we show that our upper bound can be obtained by applying the algorithmic principle of dynamic balancing (DBDB\mathrm{DB}roman_DB) in Cutkosky et al. (2021) to the stochastic bandit problem with post-action contexts. This method is motivated by the fact that, under mild assumptions, it can always achieve O~(T)~𝑂𝑇\tilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) regret when it is running on top of a collection of O~(T)~𝑂𝑇\tilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) regret base learners. So the dependence on T𝑇Titalic_T in O~(T3/4)~𝑂superscript𝑇34\tilde{O}(T^{3/4})over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 3 / 4 end_POSTSUPERSCRIPT ) regret by HACUCBHACUCB\operatorname{HAC-UCB}roman_HAC - roman_UCB in Bilodeau et al. (2022) is easily improved. The use of dynamic balancing in our bandit setting can be justified by the fact that dynamic balancing does not rely on what kind of (stochastic) contextual information can be observed in the underlying bandit problem. See Appendix A for a detailed explanation.

Algorithm 1 Dynamic balancing (DBDB\mathrm{DB}roman_DB) w/ two base learners

Input: Two base learners, {𝔞i}i=1,2subscriptsubscript𝔞𝑖𝑖12\{\mathfrak{a}_{i}\}_{i=1,2}{ fraktur_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , 2 end_POSTSUBSCRIPT, factor di()subscript𝑑𝑖d_{i}(\cdot)italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) of candidate regret bound, reward bias bi()subscript𝑏𝑖b_{i}(\cdot)italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) and scaling coefficient visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (hyper-parameters) for each base learner i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 }, and confidence level δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ).

  1. 1.

    Set Ui(0)=ni(0)=0subscript𝑈𝑖0subscript𝑛𝑖00U_{i}(0)=n_{i}(0)=0italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = 0 for all i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 } and let the set of active learners be 1={1,2}subscript112\mathcal{I}_{1}=\{1,2\}caligraphic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { 1 , 2 }

  2. 2.

    For t=1,2,,T𝑡12𝑇t=1,2,...,Titalic_t = 1 , 2 , … , italic_T do

    1. (a)

      Select learner from the active set: itargminitvidi(δ)ni(t1)subscript𝑖𝑡subscriptarg𝑖subscript𝑡subscript𝑣𝑖subscript𝑑𝑖𝛿subscript𝑛𝑖𝑡1i_{t}\in\operatorname*{arg\!\min}_{i\in\mathcal{I}_{t}}v_{i}d_{i}(\delta)\sqrt% {n_{i}(t-1)}italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_δ ) square-root start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 1 ) end_ARG

    2. (b)

      Play action Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of learner 𝔞itsubscript𝔞subscript𝑖𝑡\mathfrak{a}_{i_{t}}fraktur_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and receive reward Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and context Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

    3. (c)

      Update learner 𝔞itsubscript𝔞subscript𝑖𝑡\mathfrak{a}_{i_{t}}fraktur_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT with Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

    4. (d)

      Update ni()subscript𝑛𝑖n_{i}(\cdot)italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) and Ui()subscript𝑈𝑖U_{i}(\cdot)italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ):
      Ui(t)Ui(t1)+Yt𝕀{i=it}subscript𝑈𝑖𝑡subscript𝑈𝑖𝑡1subscript𝑌𝑡𝕀𝑖subscript𝑖𝑡U_{i}(t)\leftarrow U_{i}(t-1)+Y_{t}\mathbb{I}\mathopen{}\left\{i=i_{t}\right\}italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ← italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 1 ) + italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT blackboard_I { italic_i = italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
      ni(t)ni(t1)+𝕀{i=it}subscript𝑛𝑖𝑡subscript𝑛𝑖𝑡1𝕀𝑖subscript𝑖𝑡n_{i}(t)\leftarrow n_{i}(t-1)+\mathbb{I}\mathopen{}\left\{i=i_{t}\right\}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ← italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 1 ) + blackboard_I { italic_i = italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }

    5. (e)

      Compute adjusted average reward ηi(t)subscript𝜂𝑖𝑡\eta_{i}(t)italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) and confidence band γi(t)subscript𝛾𝑖𝑡\gamma_{i}(t)italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) for all i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 }:
      ηi(t)Ui(t)ni(t)bi(t)subscript𝜂𝑖𝑡subscript𝑈𝑖𝑡subscript𝑛𝑖𝑡subscript𝑏𝑖𝑡\eta_{i}(t)\leftarrow\frac{U_{i}(t)}{n_{i}(t)}-b_{i}(t)italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ← divide start_ARG italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) end_ARG - italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t )
      γi(t)3log(2logni(t)/δ)ni(t)subscript𝛾𝑖𝑡32subscript𝑛𝑖𝑡𝛿subscript𝑛𝑖𝑡\gamma_{i}(t)\leftarrow 3\sqrt{\frac{\log(2\log n_{i}(t)/\delta)}{n_{i}(t)}}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ← 3 square-root start_ARG divide start_ARG roman_log ( 2 roman_log italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) / italic_δ ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) end_ARG end_ARG

    6. (f)

      Update the set of active learners:
      t+1{i{1,2}:ηi(t)+γi(t)+di(δ)ni(t)maxj=1,2ηj(t)+γj(t)}\mathcal{I}_{t+1}\leftarrow\Bigl{\{}i\in\{1,2\}:\eta_{i}(t)+\gamma_{i}(t)+% \frac{d_{i}(\delta)}{\sqrt{n_{i}(t)}}\geq\max_{j=1,2}\eta_{j}(t)+\gamma_{j}(t)% \Bigl{\}}caligraphic_I start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← { italic_i ∈ { 1 , 2 } : italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) + italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) + divide start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_δ ) end_ARG start_ARG square-root start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) end_ARG end_ARG ≥ roman_max start_POSTSUBSCRIPT italic_j = 1 , 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) + italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) }

Note that dynamic balancing algorithm (Algorithm 1) is input by a set of user-specified candidate regret bounds for each base learner i𝑖iitalic_i (which takes the form of ditsubscript𝑑𝑖𝑡d_{i}\sqrt{t}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT square-root start_ARG italic_t end_ARG in our setting). In each round, DBDB\mathrm{DB}roman_DB merely picks the base learner with minimal candidate regret bound, and performs a test to identify and deactivate the learners that seem to violate their candidate regret bounds. As long as there is one base learner whose candidate regret is valid, DBDB\mathrm{DB}roman_DB is able to compete with the best of such base learners. A more comprehensive exposition of the idea behind dynamic balancing can be found in Cutkosky et al. (2021).

So naturally, we need one base learner that is favorable in benign instances and another base learner that remains robust to non-benign instances. For example, we can pick CUCBCUCB\operatorname{C-UCB}roman_C - roman_UCB and UCBUCB\operatorname{UCB}roman_UCB, but note that any other algorithm with similar regret bound can be applied as well. Formally we characterize base learners that enjoys certain regret bound in certain type of environments by the following definition:

Definition 3.3.

Let d:(0,1)>0:𝑑01subscriptabsent0d:(0,1)\to\mathbb{R}_{>0}italic_d : ( 0 , 1 ) → blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT. A family of learners 𝔞=(𝔞δ)δ(0,1)𝔞subscriptsubscript𝔞𝛿𝛿01\mathfrak{a}=(\mathfrak{a}_{\delta})_{\delta\in(0,1)}fraktur_a = ( fraktur_a start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_δ ∈ ( 0 , 1 ) end_POSTSUBSCRIPT is a d𝑑ditalic_d-benign family if, for all δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), for all benign instances, with probability at least 1O(δ)1𝑂𝛿1-O(\delta)1 - italic_O ( italic_δ ), for all t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], 𝔞δsubscript𝔞𝛿\mathfrak{a}_{\delta}fraktur_a start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT has regret no larger than d(δ)t𝑑𝛿𝑡d(\delta)\sqrt{t}italic_d ( italic_δ ) square-root start_ARG italic_t end_ARG. Similarly, a learner 𝔞𝔞\mathfrak{a}fraktur_a is a d𝑑ditalic_d-arbitrary learner if, for all δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), for all instances, with probability at least 1O(δ)1𝑂𝛿1-O(\delta)1 - italic_O ( italic_δ ), for all t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], 𝔞𝔞\mathfrak{a}fraktur_a has regret no larger than d(δ)t𝑑𝛿𝑡d(\delta)\sqrt{t}italic_d ( italic_δ ) square-root start_ARG italic_t end_ARG.

Let CUCB=(CUCB(δ))δ(0,1)CUCBsubscriptCUCB𝛿𝛿01\operatorname{C-UCB}=(\operatorname{C-UCB}(\delta))_{\delta\in(0,1)}start_OPFUNCTION roman_C - roman_UCB end_OPFUNCTION = ( start_OPFUNCTION roman_C - roman_UCB end_OPFUNCTION ( italic_δ ) ) start_POSTSUBSCRIPT italic_δ ∈ ( 0 , 1 ) end_POSTSUBSCRIPT and UCB=(UCB(δ))δ(0,1)UCBsubscriptUCB𝛿𝛿01\operatorname{UCB}=(\operatorname{UCB}(\delta))_{\delta\in(0,1)}roman_UCB = ( roman_UCB ( italic_δ ) ) start_POSTSUBSCRIPT italic_δ ∈ ( 0 , 1 ) end_POSTSUBSCRIPT be the families of instances of the CUCBCUCB\operatorname{C-UCB}roman_C - roman_UCB and UCBUCB\operatorname{UCB}roman_UCB algorithms, respectively, whose confidence band is scaled by Θ(log(1/δ))Θ1𝛿\Theta(\sqrt{\log(1/\delta)})roman_Θ ( square-root start_ARG roman_log ( 1 / italic_δ ) end_ARG ). See Appendix C for details.

Proposition 3.4.

CUCBCUCB\operatorname{C-UCB}roman_C - roman_UCB is a d𝑑ditalic_d-benign family for d(δ)=O((|𝒵|+log(T/δ))log(|𝒵|T/δ))𝑑𝛿𝑂𝒵𝑇𝛿𝒵𝑇𝛿d(\delta)=O((\sqrt{|\mathcal{Z}|}+\sqrt{\log(T/\delta)})\sqrt{\log(|\mathcal{Z% }|T/\delta)})italic_d ( italic_δ ) = italic_O ( ( square-root start_ARG | caligraphic_Z | end_ARG + square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG ) square-root start_ARG roman_log ( | caligraphic_Z | italic_T / italic_δ ) end_ARG ) and UCBUCB\operatorname{UCB}roman_UCB is a dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-arbitrary family for d(δ)=O(|𝒜|log(|𝒜|T/δ))superscript𝑑𝛿𝑂𝒜𝒜𝑇𝛿d^{\prime}(\delta)=O(\sqrt{|\mathcal{A}|\log(|\mathcal{A}|T/\delta)})italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_δ ) = italic_O ( square-root start_ARG | caligraphic_A | roman_log ( | caligraphic_A | italic_T / italic_δ ) end_ARG ).

Note that the above result for UCBUCB\operatorname{UCB}roman_UCB is folklore, but the result for CUCBCUCB\operatorname{C-UCB}roman_C - roman_UCB is new. The following result describes the adaptive regret of dynamic balancing acting on a benign family and an arbitrary family, which validates the upper bound in Theorem 3.2. What is more impressive is that to realize every point on the Pareto regret frontier (up to log factors), we need only tune the hyper-parameters in DBDB\mathrm{DB}roman_DB accordingly. We elide the dependence of rates Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on 𝒜𝒜\mathcal{A}caligraphic_A and 𝒵𝒵\mathcal{Z}caligraphic_Z below for clarity. See Section A.2 for the proof.

Theorem 3.5.

Let 𝔞1subscript𝔞1\mathfrak{a}_{1}fraktur_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT be a d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-benign family and let 𝔞2subscript𝔞2\mathfrak{a}_{2}fraktur_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be a d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-arbitrary family of learners, where d1(δ)=O((|𝒵|+log(T/δ))log(|𝒵|T/δ))subscript𝑑1𝛿𝑂𝒵𝑇𝛿𝒵𝑇𝛿d_{1}(\delta)=O\mathopen{}\left((\sqrt{|\mathcal{Z}|}+\sqrt{\log(T/\delta)})% \sqrt{\log(|\mathcal{Z}|T/\delta)}\right)italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_δ ) = italic_O ( ( square-root start_ARG | caligraphic_Z | end_ARG + square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG ) square-root start_ARG roman_log ( | caligraphic_Z | italic_T / italic_δ ) end_ARG ), d2(δ)=O(|𝒜|log(|𝒜|T/δ))subscript𝑑2𝛿𝑂𝒜𝒜𝑇𝛿d_{2}(\delta)=O(\sqrt{|\mathcal{A}|\log(|\mathcal{A}|T/\delta)})italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_δ ) = italic_O ( square-root start_ARG | caligraphic_A | roman_log ( | caligraphic_A | italic_T / italic_δ ) end_ARG ). For every pair of reasonable rate functions R1(T),R2(T)subscript𝑅1𝑇subscript𝑅2𝑇R_{1}(T),R_{2}(T)italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) , italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) such that R1(T)R2(T)|𝒜|Tsubscript𝑅1𝑇subscript𝑅2𝑇𝒜𝑇R_{1}(T)R_{2}(T)\geq|\mathcal{A}|Titalic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) ≥ | caligraphic_A | italic_T, there exist hyper-parameters bi(),visubscript𝑏𝑖subscript𝑣𝑖b_{i}(\cdot),v_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i=1,2𝑖12i=1,2italic_i = 1 , 2, such that, for all instances ν𝜈\nuitalic_ν, the policy DB(δ)DB𝛿\mathrm{DB}(\delta)roman_DB ( italic_δ ), for δ=1/T𝛿1𝑇\delta=1/Titalic_δ = 1 / italic_T, given by Algorithm 1 with 𝔞1,𝔞2subscript𝔞1subscript𝔞2\mathfrak{a}_{1},\mathfrak{a}_{2}fraktur_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , fraktur_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and d1,d2subscript𝑑1subscript𝑑2d_{1},d_{2}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, satisfies

𝔼ν,DB(δ)[Reg(T)]subscript𝔼𝜈DB𝛿delimited-[]Reg𝑇\displaystyle\mathbb{E}_{\nu,\mathrm{DB}(\delta)}[\mathrm{Reg}(T)]blackboard_E start_POSTSUBSCRIPT italic_ν , roman_DB ( italic_δ ) end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] =O~(R1(T));ν is conditionally benign,absent~𝑂subscript𝑅1𝑇𝜈 is conditionally benign,\displaystyle=\tilde{O}(R_{1}(T));\;\nu\text{ is conditionally benign,}= over~ start_ARG italic_O end_ARG ( italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) ) ; italic_ν is conditionally benign,
𝔼ν,DB(δ)[Reg(T)]subscript𝔼𝜈DB𝛿delimited-[]Reg𝑇\displaystyle\mathbb{E}_{\nu,\mathrm{DB}(\delta)}[\mathrm{Reg}(T)]blackboard_E start_POSTSUBSCRIPT italic_ν , roman_DB ( italic_δ ) end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] =O~(R2(T));ν is arbitrary.absent~𝑂subscript𝑅2𝑇𝜈 is arbitrary.\displaystyle=\tilde{O}(R_{2}(T));\;\nu\text{ is arbitrary.}= over~ start_ARG italic_O end_ARG ( italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) ) ; italic_ν is arbitrary.
Corollary 3.6.

Taking R1(T)=|𝒵|Tsubscript𝑅1𝑇𝒵𝑇R_{1}(T)=\sqrt{|\mathcal{Z}|T}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) = square-root start_ARG | caligraphic_Z | italic_T end_ARG and R2(T)=|𝒜|/|𝒵||𝒜|Tsubscript𝑅2𝑇𝒜𝒵𝒜𝑇R_{2}(T)=\sqrt{|\mathcal{A}|/|\mathcal{Z}|}\cdot\sqrt{|\mathcal{A}|T}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) = square-root start_ARG | caligraphic_A | / | caligraphic_Z | end_ARG ⋅ square-root start_ARG | caligraphic_A | italic_T end_ARG, the conclusion of Theorem 3.5 is

𝔼ν,DB(δ)[Reg(T)]subscript𝔼𝜈DB𝛿delimited-[]Reg𝑇\displaystyle\mathbb{E}_{\nu,\mathrm{DB}(\delta)}[\mathrm{Reg}(T)]blackboard_E start_POSTSUBSCRIPT italic_ν , roman_DB ( italic_δ ) end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] =O~(|𝒵|T);ν is conditionally benign,absent~𝑂𝒵𝑇𝜈 is conditionally benign,\displaystyle=\tilde{O}(\sqrt{|\mathcal{Z}|T});\;\nu\text{ is conditionally % benign,}= over~ start_ARG italic_O end_ARG ( square-root start_ARG | caligraphic_Z | italic_T end_ARG ) ; italic_ν is conditionally benign,
𝔼ν,DB(δ)[Reg(T)]subscript𝔼𝜈DB𝛿delimited-[]Reg𝑇\displaystyle\mathbb{E}_{\nu,\mathrm{DB}(\delta)}[\mathrm{Reg}(T)]blackboard_E start_POSTSUBSCRIPT italic_ν , roman_DB ( italic_δ ) end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] =O~(|𝒜|/|𝒵||𝒜|T);ν is arbitrary.absent~𝑂𝒜𝒵𝒜𝑇𝜈 is arbitrary.\displaystyle=\tilde{O}(\sqrt{|\mathcal{A}|/|\mathcal{Z}|}\cdot\sqrt{|\mathcal% {A}|T});\;\nu\text{ is arbitrary.}= over~ start_ARG italic_O end_ARG ( square-root start_ARG | caligraphic_A | / | caligraphic_Z | end_ARG ⋅ square-root start_ARG | caligraphic_A | italic_T end_ARG ) ; italic_ν is arbitrary.

Corollary 3.6 indicates that we need to pay an extra factor of |𝒜|/|𝒵|𝒜𝒵\sqrt{|\mathcal{A}|/|\mathcal{Z}|}square-root start_ARG | caligraphic_A | / | caligraphic_Z | end_ARG in the worst-case regret for adaptivity, and it already improves over the one by HACUCBHACUCB\operatorname{HAC-UCB}roman_HAC - roman_UCB in terms of worst-case regret. Moreover, our regret analysis does not require their cumbersome assumption that T25|𝒜|2𝑇25superscript𝒜2T\geq 25|\mathcal{A}|^{2}italic_T ≥ 25 | caligraphic_A | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Such improvement may be explained as follows. Both dynamic balancing and HACUCBHACUCB\operatorname{HAC-UCB}roman_HAC - roman_UCB play with two base learners and decide which to pick in each round. However, DBDB\mathrm{DB}roman_DB is operating in a more reasonable way: DBDB\mathrm{DB}roman_DB alternates between two base learners and never deactivates any of them permanently, whereas HACUCBHACUCB\operatorname{HAC-UCB}roman_HAC - roman_UCB first plays the optimistic base learner persistently up to some point and then switches to UCBUCB\operatorname{UCB}roman_UCB for the remaining rounds. Thus the regret of HACUCBHACUCB\operatorname{HAC-UCB}roman_HAC - roman_UCB incurred by running the optimistic base learner improperly may be dominant.

3.2 Lower Bounds

In this section we elaborate on the lower bound in Theorem 3.2 in the following Theorem 3.7, which is a generalization of (Bilodeau et al., 2022, Theorem 6.2). The proof of Theorem 3.7 closely follows that of the original, but we are able to derive a continuum of lower bounds that constitute the Pareto regret frontier. For completeness, we provide the full proof in Section D.1.

Theorem 3.7.

There exists constants c,c>0𝑐superscript𝑐0c,c^{\prime}>0italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0 such that, for all MAB algorithms 𝔞𝔞\mathfrak{a}fraktur_a, rate functions R(T;𝒜,𝒵)𝑅𝑇𝒜𝒵R(T;\mathcal{A},\mathcal{Z})italic_R ( italic_T ; caligraphic_A , caligraphic_Z ), if, for all 𝒜,𝒵,T𝒜𝒵𝑇\mathcal{A},\mathcal{Z},Tcaligraphic_A , caligraphic_Z , italic_T

supν𝔼ν,𝔞[Reg(T)]R(T;𝒜,𝒵),subscriptsupremum𝜈subscript𝔼𝜈𝔞delimited-[]Reg𝑇𝑅𝑇𝒜𝒵\displaystyle\sup_{\nu}\mathbb{E}_{\nu,\mathfrak{a}}[\mathrm{Reg}(T)]\leq R(T;% \mathcal{A},\mathcal{Z}),roman_sup start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_ν , fraktur_a end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] ≤ italic_R ( italic_T ; caligraphic_A , caligraphic_Z ) ,

then, for all 𝒜,𝒵𝒜𝒵\mathcal{A},\mathcal{Z}caligraphic_A , caligraphic_Z and T𝑇Titalic_T, there exists a conditionally benign environment ν𝜈\nuitalic_ν such that either R(T;𝒜,𝒵)>cT,𝑅𝑇𝒜𝒵superscript𝑐𝑇R(T;\mathcal{A},\mathcal{Z})>c^{\prime}T,italic_R ( italic_T ; caligraphic_A , caligraphic_Z ) > italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T , or there exists a conditionally benign environment ν𝜈\nuitalic_ν such that

𝔼ν,𝔞[Reg(T)]c|𝒜|TR(T;𝒜,𝒵).subscript𝔼𝜈𝔞delimited-[]Reg𝑇𝑐𝒜𝑇𝑅𝑇𝒜𝒵\displaystyle\mathbb{E}_{\nu,\mathfrak{a}}[\mathrm{Reg}(T)]\geq c\cdot\frac{|% \mathcal{A}|T}{R(T;\mathcal{A},\mathcal{Z})}.blackboard_E start_POSTSUBSCRIPT italic_ν , fraktur_a end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] ≥ italic_c ⋅ divide start_ARG | caligraphic_A | italic_T end_ARG start_ARG italic_R ( italic_T ; caligraphic_A , caligraphic_Z ) end_ARG .

Theorem 3.7 shows that any pair of realizable rates must have their product lower bounded by |𝒜|T𝒜𝑇|\mathcal{A}|T| caligraphic_A | italic_T unless the worst-case regret bound is vacuously large. Combining Theorem 3.5 with Theorem 3.7, we have justified the Pareto optimality of dynamic balancing. As a corollary, we have found a problem of adaptation where model selection method can be optimal and the price of adaptivity is witnessed by the additional multiplicative factor of |𝒜|/|𝒵|𝒜𝒵\sqrt{|\mathcal{A}|/|\mathcal{Z}|}square-root start_ARG | caligraphic_A | / | caligraphic_Z | end_ARG in the regret bound.

4 Instance-Dependent Bounds via Phased Elimination Algorithm

Besides achieving Pareto optimal regret bounds in Theorem 3.5 that are worst-case in nature, the dynamic balancing algorithm can also enjoy O(logT)𝑂𝑇O(\log T)italic_O ( roman_log italic_T ) instance-dependent regret at the same time under additional assumptions on the base learners. In particular, CUCBCUCB\operatorname{C-UCB}roman_C - roman_UCB may not be our best choice for the benign base learner. To leverage the strength of dynamic balancing, we propose a new causal bandit algorithm that enjoys O~(|𝒵|T)~𝑂𝒵𝑇\tilde{O}(\sqrt{|\mathcal{Z}|T})over~ start_ARG italic_O end_ARG ( square-root start_ARG | caligraphic_Z | italic_T end_ARG ) worst-case regret and a novel logarithmic instance-dependent regret in benign settings in this section. We are the first to pursue instance-dependent results in conditionally benign environments for algorithms that are minimax optimal (up to log factors).

Our new algorithm is built upon the idea of phased elimination with G-optimal design from linear bandits (Lattimore & Szepesvári, 2020; Lattimore et al., 2020). Our regret analysis hinges on a novel reduction from causal bandits to linear bandits. This reduction enables the use of a broad family of linear bandit algorithms in conditionally benign environments, whose regret guarantees remain intact.

Finally, we will discuss the possibilities and challenges regarding adaptive O(logT)𝑂𝑇O(\log T)italic_O ( roman_log italic_T ) instance-dependent regret.

4.1 Reduction to Linear Bandits

We need additional notations to illustrate our causal-to-linear reduction. For benign instance ν𝜈\nuitalic_ν, define the mean reward vector μ𝒵[0,1]|𝒵|superscript𝜇𝒵superscript01𝒵\mu^{\mathcal{Z}}\in[0,1]^{|\mathcal{Z}|}italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT | caligraphic_Z | end_POSTSUPERSCRIPT by μ𝒵(z)=𝔼νa[Y|Z=z],z𝒵formulae-sequencesuperscript𝜇𝒵𝑧subscript𝔼subscript𝜈𝑎delimited-[]conditional𝑌𝑍𝑧for-all𝑧𝒵\mu^{\mathcal{Z}}(z)=\mathbb{E}_{\nu_{a}}[Y|Z=z],\forall z\in\mathcal{Z}italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ( italic_z ) = blackboard_E start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_Z = italic_z ] , ∀ italic_z ∈ caligraphic_Z. Also, in this section we use νasubscript𝜈𝑎\nu_{a}italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to denote its associated marginal distribution vector νa(Z)𝒫(𝒵)|𝒵|subscript𝜈𝑎𝑍𝒫𝒵superscript𝒵\nu_{a}(Z)\in\mathcal{P}(\mathcal{Z})\subset\mathbb{R}^{|\mathcal{Z}|}italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_Z ) ∈ caligraphic_P ( caligraphic_Z ) ⊂ blackboard_R start_POSTSUPERSCRIPT | caligraphic_Z | end_POSTSUPERSCRIPT, and we won’t distinguish between an action a𝑎aitalic_a and its associated marginal vector νasubscript𝜈𝑎\nu_{a}italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT.

Recall that in each round t𝑡titalic_t we play some action Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and then observe context Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and reward Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. By simply ignoring the realized contexts Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we can write Yt=z𝒵μ𝒵(z)νAt(z)+ηt𝒜=μ𝒵,νAt+ηt𝒜subscript𝑌𝑡subscript𝑧𝒵superscript𝜇𝒵𝑧subscript𝜈subscript𝐴𝑡𝑧subscriptsuperscript𝜂𝒜𝑡superscript𝜇𝒵subscript𝜈subscript𝐴𝑡subscriptsuperscript𝜂𝒜𝑡Y_{t}=\sum_{z\in\mathcal{Z}}\mu^{\mathcal{Z}}(z)\cdot\nu_{A_{t}}(z)+\eta^{% \mathcal{A}}_{t}=\langle\mu^{\mathcal{Z}},\nu_{A_{t}}\rangle+\eta^{\mathcal{A}% }_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ( italic_z ) ⋅ italic_ν start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ) + italic_η start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT , italic_ν start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ + italic_η start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where ηt𝒜subscriptsuperscript𝜂𝒜𝑡\eta^{\mathcal{A}}_{t}italic_η start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is conditionally 1-sub-Gaussian since 𝔼[ηt𝒜|(As,Ys)st1,At]=0𝔼delimited-[]conditionalsubscriptsuperscript𝜂𝒜𝑡subscriptsubscript𝐴𝑠subscript𝑌𝑠𝑠𝑡1subscript𝐴𝑡0\mathbb{E}[\eta^{\mathcal{A}}_{t}|(A_{s},Y_{s})_{s\leq t-1},A_{t}]=0blackboard_E [ italic_η start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_s ≤ italic_t - 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = 0 and ηt𝒜[1,1]subscriptsuperscript𝜂𝒜𝑡11\eta^{\mathcal{A}}_{t}\in[-1,1]italic_η start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ - 1 , 1 ]. So now we may think of the game to be linear bandit with actions being νasubscript𝜈𝑎\nu_{a}italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and the unknown mean reward vector being μ𝒵superscript𝜇𝒵\mu^{\mathcal{Z}}italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT. Therefore, any linear bandit algorithm that allows such conditionally sub-Gaussian noise condition should be able to operate in our benign setting by ignoring the realized contexts. More importantly, its regret analysis will go through without change, and hence its regret bounds are retained without loss.

4.2 Phased Elimination and its Regret Bound

Among all valid linear bandit algorithms that can be applied in conditionally benign environments, we opt for the phase elimination algorithm (PEPE\operatorname{PE}roman_PE) over others due to its superior performance whenever our action set is finite. Its pseudo-code is summarized in Algorithm 2, which is essentially the same as Lattimore et al. (2020). However, the regret guarantees we present for PEPE\operatorname{PE}roman_PE are novel. Our first result is an anytime worst-case regret bound, which qualifies PE for being a base learner of dynamic balancing. Again, PE=(PE(δ))δ(0,1)PEsubscriptPE𝛿𝛿01\operatorname{PE}=(\operatorname{PE}(\delta))_{\delta\in(0,1)}roman_PE = ( roman_PE ( italic_δ ) ) start_POSTSUBSCRIPT italic_δ ∈ ( 0 , 1 ) end_POSTSUBSCRIPT is the family of instances of phased elimination algorithm, indexed by the confidence level δ𝛿\deltaitalic_δ.

Theorem 4.1 (Worst-case regret bound for PEPE\operatorname{PE}roman_PE).

For all δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), the policy PE(δ)PE𝛿\operatorname{PE}(\delta)roman_PE ( italic_δ ) given by Algorithm 2 satisfies the following regret bound for all conditionally benign environments ν𝜈\nuitalic_ν,

Reg(t)Cdνlog(|𝒜|logTδ)t,t[T]formulae-sequenceReg𝑡𝐶subscript𝑑𝜈𝒜𝑇𝛿𝑡for-all𝑡delimited-[]𝑇\displaystyle\mathrm{Reg}(t)\leq C\sqrt{d_{\nu}\log\mathopen{}\left(\frac{|% \mathcal{A}|\log T}{\delta}\right)t},\quad\forall t\in[T]roman_Reg ( italic_t ) ≤ italic_C square-root start_ARG italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT roman_log ( divide start_ARG | caligraphic_A | roman_log italic_T end_ARG start_ARG italic_δ end_ARG ) italic_t end_ARG , ∀ italic_t ∈ [ italic_T ]

with probability at least 1δ1𝛿1-\delta1 - italic_δ, where dν=dim(span{νa:a𝒜})subscript𝑑𝜈dimensionspanconditional-setsubscript𝜈𝑎𝑎𝒜d_{\nu}=\dim(\mathrm{span}\{\nu_{a}:a\in\mathcal{A}\})italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT = roman_dim ( roman_span { italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT : italic_a ∈ caligraphic_A } ) and C>0𝐶0C>0italic_C > 0 is a universal constant. Note that dν|𝒵|subscript𝑑𝜈𝒵d_{\nu}\leq|\mathcal{Z}|italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ≤ | caligraphic_Z | and it could be |𝒵|𝒵|\mathcal{Z}|| caligraphic_Z | in the worst-case. In particular, after taking δ=1/T𝛿1𝑇\delta=1/Titalic_δ = 1 / italic_T, we obtain the expected regret bound

𝔼ν,PE(δ)[Reg(T)]=O(dνTlog(|𝒜|T)).subscript𝔼𝜈PE𝛿delimited-[]Reg𝑇𝑂subscript𝑑𝜈𝑇𝒜𝑇\displaystyle\mathbb{E}_{\nu,\operatorname{PE}(\delta)}[\mathrm{Reg}(T)]=O% \mathopen{}\left(\sqrt{d_{\nu}T\log(|\mathcal{A}|T)}\right).blackboard_E start_POSTSUBSCRIPT italic_ν , roman_PE ( italic_δ ) end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] = italic_O ( square-root start_ARG italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_T roman_log ( | caligraphic_A | italic_T ) end_ARG ) .
Corollary 4.2.

PEPE\operatorname{PE}roman_PE is a d𝑑ditalic_d-benign family for d(δ)=O(|𝒵|log(|𝒜|logT/δ))𝑑𝛿𝑂𝒵𝒜𝑇𝛿d(\delta)=O\mathopen{}\left(\sqrt{|\mathcal{Z}|\log\mathopen{}\left(|\mathcal{% A}|\log T/\delta\right)}\right)italic_d ( italic_δ ) = italic_O ( square-root start_ARG | caligraphic_Z | roman_log ( | caligraphic_A | roman_log italic_T / italic_δ ) end_ARG ). Therefore, the expected regret bound in Theorem 3.5 can also be achieved by DBDB\mathrm{DB}roman_DB with PEPE\operatorname{PE}roman_PE and UCBUCB\operatorname{UCB}roman_UCB as base learners.

See Appendix B for the proof. Thanks to our reduction, Theorem 4.1 only depends on dνsubscript𝑑𝜈d_{\nu}italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT (up to log factors) rather than |𝒵|𝒵|\mathcal{Z}|| caligraphic_Z |. This indicates that the intrinsic complexity of causal bandit problem is not |𝒵|𝒵|\mathcal{Z}|| caligraphic_Z | and can be further reduced to dνsubscript𝑑𝜈d_{\nu}italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT, which is not captured by the O~(|𝒵|T)~𝑂𝒵𝑇\tilde{O}(\sqrt{|\mathcal{Z}|T})over~ start_ARG italic_O end_ARG ( square-root start_ARG | caligraphic_Z | italic_T end_ARG ) regret bound of CUCBCUCB\operatorname{C-UCB}roman_C - roman_UCB.

Next we give an instance-dependent regret bound for PE. Notice that this bound is even new for stochastic linear bandits (with finite action sets). See Appendix B for the proof.

Theorem 4.3 (Instance-dependent regret bound for PEPE\operatorname{PE}roman_PE).

For all δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), the policy PE(δ)PE𝛿\operatorname{PE}(\delta)roman_PE ( italic_δ ) given by Algorithm 2 satisfies the following regret for all conditionally benign environments ν𝜈\nuitalic_ν,

Reg(T)Cdνlog(|𝒜|logT/δ)Δmin(ν)Reg𝑇𝐶subscript𝑑𝜈𝒜𝑇𝛿subscriptΔ𝜈\displaystyle\mathrm{Reg}(T)\leq C\cdot\frac{d_{\nu}\log\mathopen{}\left(|% \mathcal{A}|\log T/\delta\right)}{\Delta_{\min}(\nu)}roman_Reg ( italic_T ) ≤ italic_C ⋅ divide start_ARG italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT roman_log ( | caligraphic_A | roman_log italic_T / italic_δ ) end_ARG start_ARG roman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_ν ) end_ARG

with probability at least 1δ1𝛿1-\delta1 - italic_δ, where Δmin(ν):=minaaμ𝒜(a)μ𝒜(a)assignsubscriptΔ𝜈subscript𝑎superscript𝑎superscript𝜇𝒜superscript𝑎superscript𝜇𝒜𝑎\Delta_{\min}(\nu):=\min_{a\neq a^{*}}\mu^{\mathcal{A}}(a^{*})-\mu^{\mathcal{A% }}(a)roman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_ν ) := roman_min start_POSTSUBSCRIPT italic_a ≠ italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a ) is the minimal sub-optimality gap of instance ν𝜈\nuitalic_ν and C>0𝐶0C>0italic_C > 0 is a universal constant. In particular, taking δ=1/T𝛿1𝑇\delta=1/Titalic_δ = 1 / italic_T,

𝔼ν,PE(δ)[Reg(T)]=O(dνlog(|𝒜|T)Δmin(ν)).subscript𝔼𝜈PE𝛿delimited-[]Reg𝑇𝑂subscript𝑑𝜈𝒜𝑇subscriptΔ𝜈\displaystyle\mathbb{E}_{\nu,\operatorname{PE}(\delta)}[\mathrm{Reg}(T)]=O% \mathopen{}\left(\frac{d_{\nu}\log(|\mathcal{A}|T)}{\Delta_{\min}(\nu)}\right).blackboard_E start_POSTSUBSCRIPT italic_ν , roman_PE ( italic_δ ) end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] = italic_O ( divide start_ARG italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT roman_log ( | caligraphic_A | italic_T ) end_ARG start_ARG roman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_ν ) end_ARG ) .
Algorithm 2 Phased Elimination (PEPE\operatorname{PE}roman_PE) in Causal Bandit

Input: Action set 𝒜𝒜\mathcal{A}caligraphic_A, marginals {νa:a𝒜}conditional-setsubscript𝜈𝑎𝑎𝒜\{\nu_{a}:a\in\mathcal{A}\}{ italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT : italic_a ∈ caligraphic_A }, dν=dim(span{νa:a𝒜})subscript𝑑𝜈dimensionspanconditional-setsubscript𝜈𝑎𝑎𝒜d_{\nu}=\dim(\mathrm{span}\{\nu_{a}:a\in\mathcal{A}\})italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT = roman_dim ( roman_span { italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT : italic_a ∈ caligraphic_A } ), and confidence level δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 )

  1. 1.

    Set =11\ell=1roman_ℓ = 1 and let the initial active set 𝒜1subscript𝒜1\mathcal{A}_{1}caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT be 𝒜𝒜\mathcal{A}caligraphic_A

  2. 2.

    Find some near-optimal design π𝒫(𝒜)subscript𝜋𝒫subscript𝒜\pi_{\ell}\in\mathcal{P}(\mathcal{A}_{\ell})italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∈ caligraphic_P ( caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) with maxa𝒜νaV(π)122dνsubscript𝑎subscript𝒜subscriptsuperscriptnormsubscript𝜈𝑎2𝑉superscriptsubscript𝜋12subscript𝑑𝜈\max_{a\in\mathcal{A}_{\ell}}\|\nu_{a}\|^{2}_{V(\pi_{\ell})^{-1}}\leq 2d_{\nu}roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V ( italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ 2 italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT and |supp(π)|4dνloglog(dν)+16suppsubscript𝜋4subscript𝑑𝜈subscript𝑑𝜈16|\mathrm{supp}(\pi_{\ell})|\leq 4d_{\nu}\log\log(d_{\nu})+16| roman_supp ( italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) | ≤ 4 italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT roman_log roman_log ( italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ) + 16, where V(π)=a𝒜π(a)νaνa𝑉subscript𝜋subscript𝑎subscript𝒜subscript𝜋𝑎subscript𝜈𝑎superscriptsubscript𝜈𝑎topV(\pi_{\ell})=\sum_{a\in\mathcal{A}_{\ell}}\pi_{\ell}(a)\nu_{a}\nu_{a}^{\top}italic_V ( italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a ) italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT

  3. 3.

    Let m=21(4dνloglog(dν)+16)subscript𝑚superscript214subscript𝑑𝜈subscript𝑑𝜈16m_{\ell}=2^{\ell-1}(4d_{\nu}\log\log(d_{\nu})+16)italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT ( 4 italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT roman_log roman_log ( italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ) + 16 ). Compute T(a)=mπ(a)subscript𝑇𝑎subscript𝑚subscript𝜋𝑎T_{\ell}(a)=\mathopen{}\left\lceil m_{\ell}\pi_{\ell}(a)\right\rceilitalic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a ) = ⌈ italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a ) ⌉ and T=a𝒜T(a)subscript𝑇subscript𝑎subscript𝒜subscript𝑇𝑎T_{\ell}=\sum_{a\in\mathcal{A}_{\ell}}T_{\ell}(a)italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a )

  4. 4.

    Play each action a𝒜𝑎subscript𝒜a\in\mathcal{A}_{\ell}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT exactly T(a)subscript𝑇𝑎T_{\ell}(a)italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a ) times and we call these Tsubscript𝑇T_{\ell}italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT rounds phase \ellroman_ℓ. We also observe corresponding context-reward pairs (Zt,Yt)tphase subscriptsubscript𝑍𝑡subscript𝑌𝑡𝑡phase (Z_{t},Y_{t})_{t\in\text{phase }\ell}( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT

  5. 5.

    Compute the empirical estimate: μ^𝒵=V1tphase νAtYt,subscriptsuperscript^𝜇𝒵superscriptsubscript𝑉1subscript𝑡phase subscript𝜈subscript𝐴𝑡subscript𝑌𝑡\hat{\mu}^{\mathcal{Z}}_{\ell}=V_{\ell}^{-1}\sum_{t\in\text{phase }\ell}\nu_{A% _{t}}Y_{t},over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , where V=a𝒜T(a)νaνasubscript𝑉subscript𝑎subscript𝒜subscript𝑇𝑎subscript𝜈𝑎superscriptsubscript𝜈𝑎topV_{\ell}=\sum_{a\in\mathcal{A}_{\ell}}T_{\ell}(a)\nu_{a}\nu_{a}^{\top}italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a ) italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT

  6. 6.

    Eliminate low rewarding actions and update the active set:
    𝒜+1={a𝒜:maxb𝒜μ^𝒵,νbνa24dνmlog(2|𝒜|log2(T)δ)}\mathcal{A}_{\ell+1}=\Bigl{\{}a\in\mathcal{A}_{\ell}:\max_{b\in\mathcal{A}_{% \ell}}\langle\hat{\mu}^{\mathcal{Z}}_{\ell},\nu_{b}-\nu_{a}\rangle\leq 2\sqrt{% \frac{4d_{\nu}}{m_{\ell}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}% {\delta}\right)}\Bigl{\}}caligraphic_A start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT = { italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT : roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ ≤ 2 square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG }

  7. 7.

    +11\ell\leftarrow\ell+1roman_ℓ ← roman_ℓ + 1 and Goto 2

Remark 4.4.

During the implementation of Algorithm 2, it is possible that 𝒜subscript𝒜\mathcal{A}_{\ell}caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT cannot span |𝒵|superscript𝒵\mathbb{R}^{|\mathcal{Z}|}blackboard_R start_POSTSUPERSCRIPT | caligraphic_Z | end_POSTSUPERSCRIPT for some \ellroman_ℓ such that V(π)𝑉subscript𝜋V(\pi_{\ell})italic_V ( italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) is singular for any π𝒫(𝒜)subscript𝜋𝒫subscript𝒜\pi_{\ell}\in\mathcal{P}(\mathcal{A}_{\ell})italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∈ caligraphic_P ( caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ). For example, in later phases |𝒜|subscript𝒜|\mathcal{A}_{\ell}|| caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT | can be smaller than |𝒵|𝒵|\mathcal{Z}|| caligraphic_Z |. Let’s say dim(span{νa:a𝒜})=r<|𝒵|dimensionspanconditional-setsubscript𝜈𝑎𝑎subscript𝒜𝑟𝒵\dim(\mathrm{span}\{\nu_{a}:a\in\mathcal{A}_{\ell}\})=r<|\mathcal{Z}|roman_dim ( roman_span { italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT : italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT } ) = italic_r < | caligraphic_Z |. One workaround is to apply some invertible matrix X|𝒵|×|𝒵|𝑋superscript𝒵𝒵X\in\mathbb{R}^{|\mathcal{Z}|\times|\mathcal{Z}|}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_Z | × | caligraphic_Z | end_POSTSUPERSCRIPT to every a𝒜𝑎subscript𝒜a\in\mathcal{A}_{\ell}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT such that Xνa𝑋subscript𝜈𝑎X\nu_{a}italic_X italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT can be decomposed to a dim-r𝑟ritalic_r vector (Xνa)[r]subscript𝑋subscript𝜈𝑎delimited-[]𝑟(X\nu_{a})_{[r]}( italic_X italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT [ italic_r ] end_POSTSUBSCRIPT and a tail of (|𝒵|r)𝒵𝑟(|\mathcal{Z}|-r)( | caligraphic_Z | - italic_r ) zeros, and {(Xνa)[r]:a𝒜}conditional-setsubscript𝑋subscript𝜈𝑎delimited-[]𝑟𝑎subscript𝒜\mathopen{}\left\{(X\nu_{a})_{[r]}:a\in\mathcal{A}_{\ell}\right\}{ ( italic_X italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT [ italic_r ] end_POSTSUBSCRIPT : italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT } can span rsuperscript𝑟\mathbb{R}^{r}blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. Now we use {(Xνa)[r]:a𝒜}conditional-setsubscript𝑋subscript𝜈𝑎delimited-[]𝑟𝑎subscript𝒜\mathopen{}\left\{(X\nu_{a})_{[r]}:a\in\mathcal{A}_{\ell}\right\}{ ( italic_X italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT [ italic_r ] end_POSTSUBSCRIPT : italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT } as our active set in phase \ellroman_ℓ and the analysis would go through.

4.3 Roadblocks: Instance-Dependent Bounds

Unlike adaptive worst-case regret studied in Section 3, adaptive instance-dependent regret is less understood and a general theory is still absent in the literature. In particular, we do not know if O(logT)𝑂𝑇O(\log T)italic_O ( roman_log italic_T ) regret can always be achieved, and whenever achieved, whether it is tight. These issues are illustrated for model selection methods in the following. First, it is easy to see that O(logT)𝑂𝑇O(\log T)italic_O ( roman_log italic_T ) regret can always be achieved in benign environments, e.g., by corralling PEPE\operatorname{PE}roman_PE and UCBUCB\operatorname{UCB}roman_UCB using dynamic balancing, because in this case both base learners admit logarithmic regret. However, the regret bound of UCBUCB\operatorname{UCB}roman_UCB is dominant and thus naive calculation only leads to a O(|𝒜|logT/Δmin)𝑂𝒜𝑇subscriptΔO(|\mathcal{A}|\log T/\Delta_{\min})italic_O ( | caligraphic_A | roman_log italic_T / roman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ) regret for DBDB\mathrm{DB}roman_DB. It remains open whether we can adapt to the smaller regret achieved by PE in benign environments. Second, O(logT)𝑂𝑇O(\log T)italic_O ( roman_log italic_T ) regret is not always granted by model selection in non-benign instances. The only exception we are aware of in the literature is the case where the causal base learner is assumed to incur linear regret whenever its candidate regret bound fails (Cutkosky et al., 2021, Theorem 31). If this type of “algorithm gap” holds, DBDB\mathrm{DB}roman_DB will only choose the causal base learner on a O(logT)𝑂𝑇O(\log T)italic_O ( roman_log italic_T ) number of rounds, and hence enjoy logarithmic regret. Moreover, without changing the parameter setting, DBDB\mathrm{DB}roman_DB is able to realize the Pareto optimal rates (|𝒵|T,|𝒜|T/|𝒵|)𝒵𝑇𝒜𝑇𝒵(\sqrt{|\mathcal{Z}|T},|\mathcal{A}|\sqrt{T}/\sqrt{|\mathcal{Z}|})( square-root start_ARG | caligraphic_Z | italic_T end_ARG , | caligraphic_A | square-root start_ARG italic_T end_ARG / square-root start_ARG | caligraphic_Z | end_ARG ) up to log factors. However, the “algorithmic gap” requirement on the causal base learner is so stringent that we do not know if it is met by any algorithm in every instance. In Appendix E, we show that a version of PEPE\operatorname{PE}roman_PE incurs linear regret on some instances.

5 Limited Knowledge of the Marginal Distributions over Context Variables

So far, we have assumed that algorithms knows the marginal distribution over the post-action context for each arm. Of course, perfect knowledge of these marginals may not hold in practice. What is the effect of only having access to approximate marginals on achievable rates of regret?

In this section, we study this question. We give a lower bound indicating that, with zero access to the marginals, it is impossible for any algorithm to exploit the causal structure and beat the minimax rate of an arbitrary environment. To model this setting, recall that algorithms are defined as map**s taking (𝒜,𝒵,T,q)𝒜𝒵𝑇𝑞(\mathcal{A},\mathcal{Z},T,q)( caligraphic_A , caligraphic_Z , italic_T , italic_q ) to policies. So naturally, algorithms considered agnostic to the marginals should be constant in q𝒫(𝒵)𝒜𝑞𝒫superscript𝒵𝒜q\in\mathcal{P}(\mathcal{Z})^{\mathcal{A}}italic_q ∈ caligraphic_P ( caligraphic_Z ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT, leading to the following definition:

Definition 5.1.

An algorithm 𝔞𝔞\mathfrak{a}fraktur_a is said to be agnostic to marginals if, for any 𝒜,𝒵𝒜𝒵\mathcal{A},\mathcal{Z}caligraphic_A , caligraphic_Z and T𝑇Titalic_T, the map

𝔞𝒜,𝒵,T:q𝔞(𝒜,𝒵,T,q):subscript𝔞𝒜𝒵𝑇maps-to𝑞𝔞𝒜𝒵𝑇𝑞\displaystyle\mathfrak{a}_{\mathcal{A},\mathcal{Z},T}:q\mapsto\mathfrak{a}(% \mathcal{A},\mathcal{Z},T,q)fraktur_a start_POSTSUBSCRIPT caligraphic_A , caligraphic_Z , italic_T end_POSTSUBSCRIPT : italic_q ↦ fraktur_a ( caligraphic_A , caligraphic_Z , italic_T , italic_q )

is constant over 𝒫(𝒵)𝒜𝒫superscript𝒵𝒜\mathcal{P}(\mathcal{Z})^{\mathcal{A}}caligraphic_P ( caligraphic_Z ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT. We denote the set of all such algorithms by 𝔸agnosticsubscript𝔸agnostic\mathbb{A}_{\mathrm{agnostic}}blackboard_A start_POSTSUBSCRIPT roman_agnostic end_POSTSUBSCRIPT.

Examples of algorithms from 𝔸agnosticsubscript𝔸agnostic\mathbb{A}_{\mathrm{agnostic}}blackboard_A start_POSTSUBSCRIPT roman_agnostic end_POSTSUBSCRIPT include not only heuristic non-causal algorithms like UCBUCB\operatorname{UCB}roman_UCB, but also versions of causal algorithms that are always input by the same marginals. For all algorithm 𝔞𝔸agnostic𝔞subscript𝔸agnostic\mathfrak{a}\in\mathbb{A}_{\mathrm{agnostic}}fraktur_a ∈ blackboard_A start_POSTSUBSCRIPT roman_agnostic end_POSTSUBSCRIPT, we will write the policy it induces given 𝒜,𝒵,T𝒜𝒵𝑇\mathcal{A},\mathcal{Z},Tcaligraphic_A , caligraphic_Z , italic_T as 𝔞(𝒜,𝒵,T,)𝔞𝒜𝒵𝑇\mathfrak{a}(\mathcal{A},\mathcal{Z},T,\cdot)fraktur_a ( caligraphic_A , caligraphic_Z , italic_T , ⋅ ) to highlight its independence on the q𝑞qitalic_q component. Our lower bound shows that, under this zero-marginal-knowledge regime, we cannot do better than the optimal non-causal algorithm.

Theorem 5.2.

For all 𝒜,𝒵,T|𝒜|𝒜𝒵𝑇𝒜\mathcal{A},\mathcal{Z},T\geq|\mathcal{A}|caligraphic_A , caligraphic_Z , italic_T ≥ | caligraphic_A | and MAB algorithms 𝔞𝔸agnostic𝔞subscript𝔸agnostic\mathfrak{a}\in\mathbb{A}_{\mathrm{agnostic}}fraktur_a ∈ blackboard_A start_POSTSUBSCRIPT roman_agnostic end_POSTSUBSCRIPT, there exists a conditionally benign environment ν𝒫(𝒵×𝒴)𝒜𝜈𝒫superscript𝒵𝒴𝒜\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}italic_ν ∈ caligraphic_P ( caligraphic_Z × caligraphic_Y ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT such that

𝔼ν,𝔞[Reg(T)]c|𝒜|T,subscript𝔼𝜈𝔞delimited-[]Reg𝑇𝑐𝒜𝑇\displaystyle\mathbb{E}_{\nu,\mathfrak{a}}[\mathrm{Reg}(T)]\geq c\sqrt{|% \mathcal{A}|T},blackboard_E start_POSTSUBSCRIPT italic_ν , fraktur_a end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] ≥ italic_c square-root start_ARG | caligraphic_A | italic_T end_ARG ,

where c>0𝑐0c>0italic_c > 0 is a universal constant.

See Section D.2 for the proof.

Remark 5.3.

Our lower bound improves on Lu et al. (2020, Theorem 4), which is of the form of Cε|𝒜|T1/2ε,ε>0subscript𝐶𝜀𝒜superscript𝑇12𝜀for-all𝜀0C_{\varepsilon}\sqrt{|\mathcal{A}|}T^{1/2-\varepsilon},\forall\varepsilon>0italic_C start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT square-root start_ARG | caligraphic_A | end_ARG italic_T start_POSTSUPERSCRIPT 1 / 2 - italic_ε end_POSTSUPERSCRIPT , ∀ italic_ε > 0 and only holds for some set of non-causal algorithms, which is a strict subset of 𝔸agnosticsubscript𝔸agnostic\mathbb{A}_{\mathrm{agnostic}}blackboard_A start_POSTSUBSCRIPT roman_agnostic end_POSTSUBSCRIPT.

5.1 Phased Elimination with Approximate Marginals

Despite the negative result Theorem 5.2, we now argue that some level of misspecification is allowed in the prior knowledge of marginals. Upon interacting with environment ν𝜈\nuitalic_ν, suppose we are given some marginal ν~(Z)𝒫(𝒵)𝒜~𝜈𝑍𝒫superscript𝒵𝒜\tilde{\nu}(Z)\in\mathcal{P}(\mathcal{Z})^{\mathcal{A}}over~ start_ARG italic_ν end_ARG ( italic_Z ) ∈ caligraphic_P ( caligraphic_Z ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT which may deviate from the true ν(Z)𝜈𝑍\nu(Z)italic_ν ( italic_Z ) to some extent. Now we show that even instantiating PEPE\operatorname{PE}roman_PE with the possibly non-accurate ν~(Z)~𝜈𝑍\tilde{\nu}(Z)over~ start_ARG italic_ν end_ARG ( italic_Z ) may yield O~(T)~𝑂𝑇\tilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) regret, following a similar result for CUCBCUCB\operatorname{C-UCB}roman_C - roman_UCB by Bilodeau et al. (2022). First we need the the following definition to measure the amount of deviation of ν~(Z)~𝜈𝑍\tilde{\nu}(Z)over~ start_ARG italic_ν end_ARG ( italic_Z ) from ν(Z)𝜈𝑍\nu(Z)italic_ν ( italic_Z ).

Definition 5.4.

(Bilodeau et al., 2022, Definition 4.2) For any ε0𝜀0\varepsilon\geq 0italic_ε ≥ 0, ν~(Z)~𝜈𝑍\tilde{\nu}(Z)over~ start_ARG italic_ν end_ARG ( italic_Z ) and ν(Z)𝜈𝑍\nu(Z)italic_ν ( italic_Z ) are said to be ε𝜀\varepsilonitalic_ε-close if

supa𝒜z𝒵|ν~a(z)νa(z)|ε.subscriptsupremum𝑎𝒜subscript𝑧𝒵subscript~𝜈𝑎𝑧subscript𝜈𝑎𝑧𝜀\displaystyle\sup_{a\in\mathcal{A}}\sum_{z\in\mathcal{Z}}\mathopen{}\left|% \tilde{\nu}_{a}(z)-\nu_{a}(z)\right|\leq\varepsilon.roman_sup start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT | over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_z ) - italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_z ) | ≤ italic_ε .

Due to our reduction in Section 4.1, we can find that causal bandits with misspecified marginals is reduced to the well-studied misspecified linear bandits, which yields the following regret bound that subsumes Theorem 4.1. The proof is largely based on the analysis of phased elimination in Lattimore et al. (2020, Proposition 5.1), with necessary modifications for handling conditionally sub-gaussian noises and providing an anytime regret bound. See Appendix B for details.

Theorem 5.5 (Worst-case regret bound, with approximate marginal distributions).

In any conditionally environment ν𝜈\nuitalic_ν suppose we instantiate PE(δ)PE𝛿\operatorname{PE}(\delta)roman_PE ( italic_δ ) with ν~(Z)~𝜈𝑍\tilde{\nu}(Z)over~ start_ARG italic_ν end_ARG ( italic_Z ). If ν~(Z)~𝜈𝑍\tilde{\nu}(Z)over~ start_ARG italic_ν end_ARG ( italic_Z ) and ν(Z)𝜈𝑍\nu(Z)italic_ν ( italic_Z ) are εlimit-from𝜀\varepsilon-italic_ε -close, then with probability at least 1δ1𝛿1-\delta1 - italic_δ, the regret of PE(δ)PE𝛿\operatorname{PE}(\delta)roman_PE ( italic_δ ) is bounded for all rounds t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] by

Reg(t)C(dν~log(|𝒜|logTδ)t+εtdν~logT),Reg𝑡𝐶subscript𝑑~𝜈𝒜𝑇𝛿𝑡𝜀𝑡subscript𝑑~𝜈𝑇\displaystyle\mathrm{Reg}(t)\leq C\mathopen{}\left(\sqrt{d_{\tilde{\nu}}\log% \mathopen{}\left(\frac{|\mathcal{A}|\log T}{\delta}\right)t}+\varepsilon t% \sqrt{d_{\tilde{\nu}}}\log T\right),roman_Reg ( italic_t ) ≤ italic_C ( square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT roman_log ( divide start_ARG | caligraphic_A | roman_log italic_T end_ARG start_ARG italic_δ end_ARG ) italic_t end_ARG + italic_ε italic_t square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG roman_log italic_T ) ,

where C>0𝐶0C>0italic_C > 0 is a universal constant and dν~=dim(span{ν~a:a𝒜})subscript𝑑~𝜈dimensionspanconditional-setsubscript~𝜈𝑎𝑎𝒜d_{\tilde{\nu}}=\dim(\mathrm{span}\{\tilde{\nu}_{a}:a\in\mathcal{A}\})italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT = roman_dim ( roman_span { over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT : italic_a ∈ caligraphic_A } )

It is implied that ε=O~(1/T)𝜀~𝑂1𝑇\varepsilon=\tilde{O}(\sqrt{1/T})italic_ε = over~ start_ARG italic_O end_ARG ( square-root start_ARG 1 / italic_T end_ARG ) suffices to recover all aforementioned regret guarantees of phased elimination and dynamic balancing. On the other hand, such numerical requirement on ε𝜀\varepsilonitalic_ε is almost necessary for us to avoid the lower bound in Theorem 5.2: from the proof of Theorem 5.2 we will find that when ε=Ω(|𝒜|/T)𝜀Ω𝒜𝑇\varepsilon=\Omega(\sqrt{|\mathcal{A}|/T})italic_ε = roman_Ω ( square-root start_ARG | caligraphic_A | / italic_T end_ARG ), for any algorithm there exists a conditionally benign environment ν𝜈\nuitalic_ν and approximate marginal ν~(Z)~𝜈𝑍\tilde{\nu}(Z)over~ start_ARG italic_ν end_ARG ( italic_Z ) such that ν~(Z)~𝜈𝑍\tilde{\nu}(Z)over~ start_ARG italic_ν end_ARG ( italic_Z ) and ν(Z)𝜈𝑍\nu(Z)italic_ν ( italic_Z ) are ε𝜀\varepsilonitalic_ε-close, but this algorithm would incur Ω(|𝒜|T)Ω𝒜𝑇\Omega(\sqrt{|\mathcal{A}|T})roman_Ω ( square-root start_ARG | caligraphic_A | italic_T end_ARG ) regret on ν𝜈\nuitalic_ν when it is input by ν~(Z)~𝜈𝑍\tilde{\nu}(Z)over~ start_ARG italic_ν end_ARG ( italic_Z ).

It is worth mentioning that the dν~subscript𝑑~𝜈\sqrt{d_{\tilde{\nu}}}square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG factor in the misspecification term cannot be improved in many regimes for linear bandit algorithms (Lattimore et al., 2020). However, CUCBCUCB\operatorname{C-UCB}roman_C - roman_UCB is able to shave this factor off (Bilodeau et al., 2022, Theorem 4.3) by utilizing realized contexts rather than the least-square estimate of the mean reward vector μ𝒵superscript𝜇𝒵\mu^{\mathcal{Z}}italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT. From this perspective, we see there is a price for pursuing better instance-dependent result by ignoring the context information.

6 Conclusions and Discussions

We provide a comprehensive characterization of the Pareto regret frontier for the bandit problem in the context of adapting to causal structure whenever feasible. We also give the first instance-dependent regret bound under conditionally benign environments, based on our novel causal-to-linear reduction. Finally, we show that the common assumption that we have access to the true marginals is necessary in general but still can be relaxed in some cases.

For future works, it would be important to focus on the design of algorithms that are easier to implement compared to running dynamic balancing over some base learners. On the theoretical side, it would be interesting to investigate other causal bandit scenarios involving adaptivity in light of our Pareto regret frontier. For example, we may define a series of “semi-benign” settings interpolating conditionally benign environments and non-benign environments and study the Pareto regret frontier thereof.

Acknowledgements

ZL is supported by the Vector Research Grant at the Vector Institute. IA is supported by the Vatat Scholarship from the Israeli Council for Higher Education. DMR is supported by an NSERC Discovery Grant and funding through his Canada CIFAR AI Chair at the Vector Institute. The authors would like to thank Tomer Koren, Blair Bilodeau and Csaba Szepesvári for helpful discussions at different stages of this work.

References

  • Agarwal et al. (2017) Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E. Corralling a band of bandit algorithms. In Conference on Learning Theory. PMLR, 2017.
  • Arora et al. (2021) Arora, R., Marinov, T. V., and Mohri, M. Corralling stochastic bandit algorithms. In International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
  • Bareinboim et al. (2015) Bareinboim, E., Forney, A., and Pearl, J. Bandits with unobserved confounders: A causal approach. Advances in Neural Information Processing Systems, 28, 2015.
  • Bilodeau et al. (2022) Bilodeau, B., Wang, L., and Roy, D. Adaptively exploiting d-separators with causal bandits. Advances in Neural Information Processing Systems, 35, 2022.
  • Cutkosky et al. (2020) Cutkosky, A., Das, A., and Purohit, M. Upper confidence bounds for combining stochastic bandits. arXiv preprint arXiv:2012.13115, 2020.
  • Cutkosky et al. (2021) Cutkosky, A., Dann, C., Das, A., Gentile, C., Pacchiano, A., and Purohit, M. Dynamic balancing for model selection in bandits and RL. In International Conference on Machine Learning. PMLR, 2021.
  • Koolen (2013) Koolen, W. M. The Pareto regret frontier. Advances in Neural Information Processing Systems, 26, 2013.
  • Lattimore et al. (2016) Lattimore, F., Lattimore, T., and Reid, M. D. Causal bandits: Learning good interventions via causal inference. Advances in Neural Information Processing Systems, 29, 2016.
  • Lattimore (2015) Lattimore, T. The pareto regret frontier for bandits. Advances in Neural Information Processing Systems, 28, 2015.
  • Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
  • Lattimore et al. (2020) Lattimore, T., Szepesvari, C., and Weisz, G. Learning with good feature representations in bandits and in RL with a generative model. In International Conference on Machine Learning. PMLR, 2020.
  • Lu et al. (2020) Lu, Y., Meisami, A., Tewari, A., and Yan, W. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence. PMLR, 2020.
  • Lu et al. (2021) Lu, Y., Meisami, A., and Tewari, A. Causal bandits with unknown graph structure. Advances in Neural Information Processing Systems, 34, 2021.
  • Malek et al. (2023) Malek, A., Aglietti, V., and Chiappa, S. Additive causal bandits with unknown graph. arXiv preprint arXiv:2306.07858, 2023.
  • Marinov & Zimmert (2021) Marinov, T. V. and Zimmert, J. The pareto frontier of model selection for general contextual bandits. Advances in Neural Information Processing Systems, 34, 2021.
  • Nair et al. (2021) Nair, V., Patil, V., and Sinha, G. Budgeted and non-budgeted causal bandits. In International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
  • Pacchiano et al. (2020a) Pacchiano, A., Dann, C., Gentile, C., and Bartlett, P. Regret bound balancing and elimination for model selection in bandits and rl. arXiv preprint arXiv:2012.13045, 2020a.
  • Pacchiano et al. (2020b) Pacchiano, A., Phan, M., Abbasi Yadkori, Y., Rao, A., Zimmert, J., Lattimore, T., and Szepesvari, C. Model selection in contextual stochastic bandit problems. Advances in Neural Information Processing Systems, 33, 2020b.
  • Sen et al. (2017) Sen, R., Shanmugam, K., Dimakis, A. G., and Shakkottai, S. Identifying best interventions through online importance sampling. In International Conference on Machine Learning. PMLR, 2017.
  • Xiong & Chen (2022) Xiong, N. and Chen, W. Pure exploration of causal bandits. arXiv preprint arXiv:2206.07883, 2022.
  • Zhu & Nowak (2022) Zhu, Y. and Nowak, R. Pareto optimal model selection in linear bandits. In International Conference on Artificial Intelligence and Statistics. PMLR, 2022.

Appendix A Regret Analysis for Dynamic Balancing

In this section we show that the regret guarantees of dynamic balancing in Cutkosky et al. (2021) can be generalized to our problem and provide a proof of our main upper bound Theorem 3.5.

Notations.

For base learner i𝑖iitalic_i, we use CandidRegi(t)subscriptCandidReg𝑖𝑡\mathrm{CandidReg}_{i}(t)roman_CandidReg start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) to denote its candidate anytime regret bound that is expected to hold in its favorable settings. Throughout we consider CandidRegi(t)subscriptCandidReg𝑖𝑡\mathrm{CandidReg}_{i}(t)roman_CandidReg start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) with the form of ditsubscript𝑑𝑖𝑡d_{i}\sqrt{t}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT square-root start_ARG italic_t end_ARG, where disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT implicitly depends on the confidence parameter δ𝛿\deltaitalic_δ. Let itsubscript𝑖𝑡i_{t}italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the index of the base learner selected in round t𝑡titalic_t. Ui(t)=s=1tYs𝕀{i=is}subscript𝑈𝑖𝑡superscriptsubscript𝑠1𝑡subscript𝑌𝑠𝕀𝑖subscript𝑖𝑠U_{i}(t)=\sum_{s=1}^{t}Y_{s}\mathbb{I}\mathopen{}\left\{i=i_{s}\right\}italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT blackboard_I { italic_i = italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } is the observed cumulative reward in the first t𝑡titalic_t rounds where i𝑖iitalic_i is picked, and ni(t)=s=1t𝕀{i=is}subscript𝑛𝑖𝑡superscriptsubscript𝑠1𝑡𝕀𝑖subscript𝑖𝑠n_{i}(t)=\sum_{s=1}^{t}\mathbb{I}\mathopen{}\left\{i=i_{s}\right\}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_I { italic_i = italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } is the number of rounds i𝑖iitalic_i is picked by the end of round t𝑡titalic_t. The local regret of i𝑖iitalic_i up to round t𝑡titalic_t is Regi(t)=ni(t)μ𝒜(a)Ui(t)subscriptReg𝑖𝑡subscript𝑛𝑖𝑡superscript𝜇𝒜superscript𝑎subscript𝑈𝑖𝑡\mathrm{Reg}_{i}(t)=n_{i}(t)\mu^{\mathcal{A}}(a^{*})-U_{i}(t)roman_Reg start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ). We say learner i𝑖iitalic_i is well-specified if Regi(t)CandidRegi(ni(t))=dini(t),t[T]formulae-sequencesubscriptReg𝑖𝑡subscriptCandidReg𝑖subscript𝑛𝑖𝑡subscript𝑑𝑖subscript𝑛𝑖𝑡for-all𝑡delimited-[]𝑇\mathrm{Reg}_{i}(t)\leq\mathrm{CandidReg}_{i}(n_{i}(t))=d_{i}\sqrt{n_{i}(t)},% \forall t\in[T]roman_Reg start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ≤ roman_CandidReg start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) = italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT square-root start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) end_ARG , ∀ italic_t ∈ [ italic_T ] and otherwise it is misspecified. We use isubscript𝑖i_{\star}italic_i start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT to denote any well-specified learner.

A.1 Preliminaries

Roughly speaking, in each round t𝑡titalic_t, dynamic balancing works by (1) running a misspecification test to temporarily de-activate misspecified base learners and (2) picking the learner itsubscript𝑖𝑡i_{t}italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with minimal putative regret dini(t)subscript𝑑𝑖subscript𝑛𝑖𝑡d_{i}\sqrt{n_{i}(t)}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT square-root start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) end_ARG among all active learners i𝑖iitalic_i in this round. In this way, the regret incurred by DBDB\mathrm{DB}roman_DB is comparable to that of the best well-specified learner.

Notice that dynamic balancing was initiated with stochastic contextual bandits (where contexts are revealed prior to actions) in Cutkosky et al. (2021). To see that DBDB\mathrm{DB}roman_DB can also be applied in stochastic bandits with post-action contexts, it is worth identifying several important features of DBDB\mathrm{DB}roman_DB:

  1. 1.

    First of all, the meta decision by DBDB\mathrm{DB}roman_DB on each round t𝑡titalic_t only depends on the global information, i.e. Ui(t)subscript𝑈𝑖𝑡U_{i}(t)italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) and ni(t)subscript𝑛𝑖𝑡n_{i}(t)italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) (as well as user-specified di,bisubscript𝑑𝑖subscript𝑏𝑖d_{i},b_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT). In particular, it does not need any information regarding context variables or internal states of base learners.

  2. 2.

    Second, DBDB\mathrm{DB}roman_DB only updates the selected base learner itsubscript𝑖𝑡i_{t}italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in each round t𝑡titalic_t, and the update only uses the reward and contextual information observed in this round, where the context can be either pre-action or post-action, or both. Thus the regret guarantees of DBDB\mathrm{DB}roman_DB would hold regardless of the nature of contexts given that the internal updates of base learners are not affected.

Therefore, the essence of dynamic balancing does not rely on what kind of (stochastic) contextual information can be observed in the underlying (stochastic) bandit problem due to above observations.

Now we state the worst-case regret bound of DBDB\mathrm{DB}roman_DB in Cutkosky et al. (2021) adapted to our setting. First define the good event

(δ)={i{1,2},tT:|ni(t)μ𝒜(a)Ui(t)Regi(t)|cni(t)log(2logni(t)δ)}𝛿conditional-setformulae-sequencefor-all𝑖12for-all𝑡𝑇subscript𝑛𝑖𝑡superscript𝜇𝒜superscript𝑎subscript𝑈𝑖𝑡subscriptReg𝑖𝑡𝑐subscript𝑛𝑖𝑡2subscript𝑛𝑖𝑡𝛿\displaystyle\mathcal{E}(\delta)=\mathopen{}\left\{\forall i\in\{1,2\},\forall t% \in T:|n_{i}(t)\mu^{\mathcal{A}}(a^{*})-U_{i}(t)-\mathrm{Reg}_{i}(t)|\leq c% \sqrt{n_{i}(t)\log\mathopen{}\left(\frac{2\log n_{i}(t)}{\delta}\right)}\right\}caligraphic_E ( italic_δ ) = { ∀ italic_i ∈ { 1 , 2 } , ∀ italic_t ∈ italic_T : | italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - roman_Reg start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) | ≤ italic_c square-root start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) roman_log ( divide start_ARG 2 roman_log italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG italic_δ end_ARG ) end_ARG }

on which we are able to control the regret of DBDB\mathrm{DB}roman_DB. According to the analysis of Cutkosky et al. (2021, Lemma 5), we can fix c𝑐citalic_c to be some absolute constant (which can be actually set to 3333 in our setting) such that ν,π[(δ)]1δsubscript𝜈𝜋delimited-[]𝛿1𝛿\mathbb{P}_{\nu,\pi}[\mathcal{E}(\delta)]\geq 1-\deltablackboard_P start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ caligraphic_E ( italic_δ ) ] ≥ 1 - italic_δ for any ν𝒫(𝒵×𝒴)𝒜𝜈𝒫superscript𝒵𝒴𝒜\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}italic_ν ∈ caligraphic_P ( caligraphic_Z × caligraphic_Y ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT and πΠ(𝒜,𝒵,T)𝜋Π𝒜𝒵𝑇\pi\in\Pi(\mathcal{A},\mathcal{Z},T)italic_π ∈ roman_Π ( caligraphic_A , caligraphic_Z , italic_T ). Conditioning on (δ)𝛿\mathcal{E}(\delta)caligraphic_E ( italic_δ ), we have the following regret bound:

Proposition A.1 (Adapted version of Theorem 22 in Cutkosky et al. (2021)).

Let 𝔞1subscript𝔞1\mathfrak{a}_{1}fraktur_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT be a d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-benign family and let 𝔞2subscript𝔞2\mathfrak{a}_{2}fraktur_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be a d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-arbitrary family of learners. Let Z1,Z2subscript𝑍1subscript𝑍2Z_{1},Z_{2}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be arbitrary positive real numbers. For all δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), we can set hyper-parameters

bi(t)=max{2Zit,32log(2logt/δ)t},vi=Zidi(δ)3\displaystyle b_{i}(t)=\max\Bigl{\{}\frac{2Z_{i}}{\sqrt{t}},\frac{3\sqrt{2\log% (2\log t/\delta)}}{\sqrt{t}}\Bigl{\}},\quad v_{i}=\sqrt{\frac{Z_{i}}{d_{i}(% \delta)^{3}}}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = roman_max { divide start_ARG 2 italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_t end_ARG end_ARG , divide start_ARG 3 square-root start_ARG 2 roman_log ( 2 roman_log italic_t / italic_δ ) end_ARG end_ARG start_ARG square-root start_ARG italic_t end_ARG end_ARG } , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_δ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_ARG

in dynamic balancing such that, the policy DB(δ)DB𝛿\mathrm{DB}(\delta)roman_DB ( italic_δ ) given by dynamic balancing with 𝔞1,𝔞2,d1,d2subscript𝔞1subscript𝔞2subscript𝑑1subscript𝑑2\mathfrak{a}_{1},\mathfrak{a}_{2},d_{1},d_{2}fraktur_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , fraktur_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT satisfies the following: for all instances ν𝜈\nuitalic_ν, conditioning on (δ)𝛿\mathcal{E}(\delta)caligraphic_E ( italic_δ ) and the existence of a well-specified base learner isubscript𝑖i_{\star}italic_i start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT, the regret of DB(δ)DB𝛿\mathrm{DB}(\delta)roman_DB ( italic_δ ) is bounded by

Reg(T)Regi(T)+C(log(logTδ)+Zidi(δ)+iidi(δ)Zi)T,Reg𝑇subscriptRegsubscript𝑖𝑇superscript𝐶𝑇𝛿subscript𝑍subscript𝑖subscript𝑑subscript𝑖𝛿subscript𝑖subscript𝑖subscript𝑑𝑖𝛿subscript𝑍𝑖𝑇\displaystyle\mathrm{Reg}(T)\leq\mathrm{Reg}_{i_{\star}}(T)+C^{\prime}% \mathopen{}\left(\sqrt{\log\mathopen{}\left(\frac{\log T}{\delta}\right)}+Z_{i% _{\star}}d_{i_{\star}}(\delta)+\sum_{i\neq i_{\star}}\frac{d_{i}(\delta)}{Z_{i% }}\right)\sqrt{T},roman_Reg ( italic_T ) ≤ roman_Reg start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T ) + italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( square-root start_ARG roman_log ( divide start_ARG roman_log italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + italic_Z start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_δ ) + ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_δ ) end_ARG start_ARG italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) square-root start_ARG italic_T end_ARG ,

where C>0superscript𝐶0C^{\prime}>0italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0 is a universal constant.

It is straightforward to see that Proposition A.1 is obtained by taking M=2,C=1,c=3,β=1/2formulae-sequence𝑀2formulae-sequence𝐶1formulae-sequence𝑐3𝛽12M=2,C=1,c=3,\beta=1/2italic_M = 2 , italic_C = 1 , italic_c = 3 , italic_β = 1 / 2, and W1=W2=2subscript𝑊1subscript𝑊22W_{1}=W_{2}=\sqrt{2}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = square-root start_ARG 2 end_ARG in Cutkosky et al. (2021, Theorem 22).

A.2 Proof of Theorem 3.5

Theorem 3.5 is the immediate consequence of the following regret bound, which is derived by instantiating Z1,Z2subscript𝑍1subscript𝑍2Z_{1},Z_{2}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in Proposition A.1 with specific values.

Proposition A.2.

For every pair of reasonable rate functions R1(T),R2(T)subscript𝑅1𝑇subscript𝑅2𝑇R_{1}(T),R_{2}(T)italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) , italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) such that R1(T)R2(T)|𝒜|Tsubscript𝑅1𝑇subscript𝑅2𝑇𝒜𝑇R_{1}(T)R_{2}(T)\geq|\mathcal{A}|Titalic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) ≥ | caligraphic_A | italic_T, we can instantiate Proposition A.1 with Z1=1,Z2=R2(T)|𝒜|Tformulae-sequencesubscript𝑍11subscript𝑍2subscript𝑅2𝑇𝒜𝑇Z_{1}=1,Z_{2}=\frac{R_{2}(T)}{\sqrt{|\mathcal{A}|T}}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG square-root start_ARG | caligraphic_A | italic_T end_ARG end_ARG such that for all δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), the policy DB(δ)DB𝛿\mathrm{DB}(\delta)roman_DB ( italic_δ ) with the same setup as Proposition A.1 satisfies the following: for all instances ν𝜈\nuitalic_ν, with probability at least 1O(δ)1𝑂𝛿1-O(\delta)1 - italic_O ( italic_δ ), the regret of DB(δ)DB𝛿\mathrm{DB}(\delta)roman_DB ( italic_δ ) is bounded by

Reg(T)Reg𝑇\displaystyle\mathrm{Reg}(T)roman_Reg ( italic_T ) C(d1(δ)+log(logTδ)+d2(δ)|𝒜|TR1(T))T,if ν is conditionally benign;absentsuperscript𝐶subscript𝑑1𝛿𝑇𝛿subscript𝑑2𝛿𝒜𝑇subscript𝑅1𝑇𝑇if ν is conditionally benign;\displaystyle\leq C^{\prime}\mathopen{}\left(d_{1}(\delta)+\sqrt{\log\mathopen% {}\left(\frac{\log T}{\delta}\right)}+\frac{d_{2}(\delta)}{\sqrt{|\mathcal{A}|% T}}R_{1}(T)\right)\sqrt{T},\quad\text{if $\nu$ is conditionally benign;}≤ italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_δ ) + square-root start_ARG roman_log ( divide start_ARG roman_log italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + divide start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_δ ) end_ARG start_ARG square-root start_ARG | caligraphic_A | italic_T end_ARG end_ARG italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) ) square-root start_ARG italic_T end_ARG , if italic_ν is conditionally benign;
Reg(T)Reg𝑇\displaystyle\mathrm{Reg}(T)roman_Reg ( italic_T ) C(d1(δ)+log(logTδ)+d2(δ)|𝒜|TR2(T))T,if ν is arbitrary,absentsuperscript𝐶subscript𝑑1𝛿𝑇𝛿subscript𝑑2𝛿𝒜𝑇subscript𝑅2𝑇𝑇if ν is arbitrary,\displaystyle\leq C^{\prime}\mathopen{}\left(d_{1}(\delta)+\sqrt{\log\mathopen% {}\left(\frac{\log T}{\delta}\right)}+\frac{d_{2}(\delta)}{\sqrt{|\mathcal{A}|% T}}R_{2}(T)\right)\sqrt{T},\quad\text{if $\nu$ is arbitrary,}≤ italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_δ ) + square-root start_ARG roman_log ( divide start_ARG roman_log italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + divide start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_δ ) end_ARG start_ARG square-root start_ARG | caligraphic_A | italic_T end_ARG end_ARG italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) ) square-root start_ARG italic_T end_ARG , if italic_ν is arbitrary,

where C>0superscript𝐶0C^{\prime}>0italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0 is a universal constant.

Now we can see that our main upper bound Theorem 3.5 is proved immediately after taking d1(δ)=O((|𝒵|+log(T/δ))log(|𝒵|T/δ))subscript𝑑1𝛿𝑂𝒵𝑇𝛿𝒵𝑇𝛿d_{1}(\delta)=O\mathopen{}\left((\sqrt{|\mathcal{Z}|}+\sqrt{\log(T/\delta)})% \sqrt{\log(|\mathcal{Z}|T/\delta)}\right)italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_δ ) = italic_O ( ( square-root start_ARG | caligraphic_Z | end_ARG + square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG ) square-root start_ARG roman_log ( | caligraphic_Z | italic_T / italic_δ ) end_ARG ), d2(δ)=O(|𝒜|log(|𝒜|T/δ))subscript𝑑2𝛿𝑂𝒜𝒜𝑇𝛿d_{2}(\delta)=O(\sqrt{|\mathcal{A}|\log(|\mathcal{A}|T/\delta)})italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_δ ) = italic_O ( square-root start_ARG | caligraphic_A | roman_log ( | caligraphic_A | italic_T / italic_δ ) end_ARG ) and δ=1/T𝛿1𝑇\delta=1/Titalic_δ = 1 / italic_T.

Proof of Proposition A.2.

By Definition 3.3, we know that for all conditionally instances ν𝜈\nuitalic_ν, with probability at least 1O(δ)1𝑂𝛿1-O(\delta)1 - italic_O ( italic_δ ), learner 𝔞1subscript𝔞1\mathfrak{a}_{1}fraktur_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is well-specified with CandidReg1(t)=d1(δ)tsubscriptCandidReg1𝑡subscript𝑑1𝛿𝑡\mathrm{CandidReg}_{1}(t)=d_{1}(\delta)\sqrt{t}roman_CandidReg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) = italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_δ ) square-root start_ARG italic_t end_ARG and the regret bound in Proposition A.1 holds with i=1subscript𝑖1i_{\star}=1italic_i start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT = 1. Plugging in Z1=1,Z2=R2(T)|𝒜|Tformulae-sequencesubscript𝑍11subscript𝑍2subscript𝑅2𝑇𝒜𝑇Z_{1}=1,Z_{2}=\frac{R_{2}(T)}{\sqrt{|\mathcal{A}|T}}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG square-root start_ARG | caligraphic_A | italic_T end_ARG end_ARG, the regret of DB(δ)DB𝛿\mathrm{DB}(\delta)roman_DB ( italic_δ ) is bounded by

Reg(T)C(2d1(δ)+log(logTδ)+d2(δ)|𝒜|TR2(T))T.Reg𝑇superscript𝐶2subscript𝑑1𝛿𝑇𝛿subscript𝑑2𝛿𝒜𝑇subscript𝑅2𝑇𝑇\displaystyle\mathrm{Reg}(T)\leq C^{\prime}\mathopen{}\left(2d_{1}(\delta)+% \sqrt{\log\mathopen{}\left(\frac{\log T}{\delta}\right)}+d_{2}(\delta)\frac{% \sqrt{|\mathcal{A}|T}}{R_{2}(T)}\right)\sqrt{T}.roman_Reg ( italic_T ) ≤ italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( 2 italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_δ ) + square-root start_ARG roman_log ( divide start_ARG roman_log italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_δ ) divide start_ARG square-root start_ARG | caligraphic_A | italic_T end_ARG end_ARG start_ARG italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG ) square-root start_ARG italic_T end_ARG .

Similarly for all instances ν𝜈\nuitalic_ν, with probability at least 1O(δ)1𝑂𝛿1-O(\delta)1 - italic_O ( italic_δ ), learner 𝔞2subscript𝔞2\mathfrak{a}_{2}fraktur_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is well-specified with CandidReg2(t)=d2(δ)tsubscriptCandidReg2𝑡subscript𝑑2𝛿𝑡\mathrm{CandidReg}_{2}(t)=d_{2}(\delta)\sqrt{t}roman_CandidReg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) = italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_δ ) square-root start_ARG italic_t end_ARG and the regret bound in Proposition A.1 holds with i=2subscript𝑖2i_{\star}=2italic_i start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT = 2, which is

Reg(T)C(d2(δ)+log(logTδ)+d2(δ)R2(T)|𝒜|T+d1(δ))T.Reg𝑇superscript𝐶subscript𝑑2𝛿𝑇𝛿subscript𝑑2𝛿subscript𝑅2𝑇𝒜𝑇subscript𝑑1𝛿𝑇\displaystyle\mathrm{Reg}(T)\leq C^{\prime}\mathopen{}\left(d_{2}(\delta)+% \sqrt{\log\mathopen{}\left(\frac{\log T}{\delta}\right)}+d_{2}(\delta)\frac{R_% {2}(T)}{\sqrt{|\mathcal{A}|T}}+d_{1}(\delta)\right)\sqrt{T}.roman_Reg ( italic_T ) ≤ italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_δ ) + square-root start_ARG roman_log ( divide start_ARG roman_log italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_δ ) divide start_ARG italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG square-root start_ARG | caligraphic_A | italic_T end_ARG end_ARG + italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_δ ) ) square-root start_ARG italic_T end_ARG .

By our assumption that (R1(T),R2(T))subscript𝑅1𝑇subscript𝑅2𝑇(R_{1}(T),R_{2}(T))( italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) , italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) ) is reasonable and R1(T)R2(T)|𝒜|Tsubscript𝑅1𝑇subscript𝑅2𝑇𝒜𝑇R_{1}(T)R_{2}(T)\geq|\mathcal{A}|Titalic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) ≥ | caligraphic_A | italic_T, we have that R2(T)|𝒜|Tsubscript𝑅2𝑇𝒜𝑇R_{2}(T)\geq\sqrt{|\mathcal{A}|T}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) ≥ square-root start_ARG | caligraphic_A | italic_T end_ARG and |𝒜|TR2(T)R1(T)𝒜𝑇subscript𝑅2𝑇subscript𝑅1𝑇\frac{|\mathcal{A}|T}{R_{2}(T)}\leq R_{1}(T)divide start_ARG | caligraphic_A | italic_T end_ARG start_ARG italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG ≤ italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ). Hence the regret of DB(δ)DB𝛿\mathrm{DB}(\delta)roman_DB ( italic_δ ) for all instances ν𝜈\nuitalic_ν is further bounded by

Reg(T)Reg𝑇\displaystyle\mathrm{Reg}(T)roman_Reg ( italic_T ) C(d1(δ)+log(logTδ)+d2(δ)|𝒜|TR1(T))T,if ν is conditionally benign;absentsuperscript𝐶subscript𝑑1𝛿𝑇𝛿subscript𝑑2𝛿𝒜𝑇subscript𝑅1𝑇𝑇if ν is conditionally benign;\displaystyle\leq C^{\prime}\mathopen{}\left(d_{1}(\delta)+\sqrt{\log\mathopen% {}\left(\frac{\log T}{\delta}\right)}+\frac{d_{2}(\delta)}{\sqrt{|\mathcal{A}|% T}}R_{1}(T)\right)\sqrt{T},\quad\text{if $\nu$ is conditionally benign;}≤ italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_δ ) + square-root start_ARG roman_log ( divide start_ARG roman_log italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + divide start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_δ ) end_ARG start_ARG square-root start_ARG | caligraphic_A | italic_T end_ARG end_ARG italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ) ) square-root start_ARG italic_T end_ARG , if italic_ν is conditionally benign;
Reg(T)Reg𝑇\displaystyle\mathrm{Reg}(T)roman_Reg ( italic_T ) C(d1(δ)+log(logTδ)+d2(δ)|𝒜|TR2(T))T,if ν is arbitrary,absentsuperscript𝐶subscript𝑑1𝛿𝑇𝛿subscript𝑑2𝛿𝒜𝑇subscript𝑅2𝑇𝑇if ν is arbitrary,\displaystyle\leq C^{\prime}\mathopen{}\left(d_{1}(\delta)+\sqrt{\log\mathopen% {}\left(\frac{\log T}{\delta}\right)}+\frac{d_{2}(\delta)}{\sqrt{|\mathcal{A}|% T}}R_{2}(T)\right)\sqrt{T},\quad\text{if $\nu$ is arbitrary,}≤ italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_δ ) + square-root start_ARG roman_log ( divide start_ARG roman_log italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + divide start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_δ ) end_ARG start_ARG square-root start_ARG | caligraphic_A | italic_T end_ARG end_ARG italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) ) square-root start_ARG italic_T end_ARG , if italic_ν is arbitrary,

which completes the proof. ∎

Appendix B Regret analysis of phased elimination

In this section we will prove Theorem 4.3 and Theorem 5.5, while Theorem 4.1 is implied by taking ε=0𝜀0\varepsilon=0italic_ε = 0 in Theorem 5.5. Recall that the proof of Theorem 5.5 is based on the analysis of phased elimination in Lattimore et al. (2020, Proposition 5.1). For simplicity we will use \mathbb{P}blackboard_P and 𝔼𝔼\mathbb{E}blackboard_E to denote the probabilistic operators determined jointly by the underlying conditionally benign environment ν𝜈\nuitalic_ν and the phased elimination algorithm. Also we use Δa,ΔminsubscriptΔ𝑎subscriptΔ\Delta_{a},\Delta_{\min}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , roman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT to denote the true sub-optimality gap Δa(ν)=μ𝒜(a)μ𝒜(a)subscriptΔ𝑎𝜈superscript𝜇𝒜superscript𝑎superscript𝜇𝒜𝑎\Delta_{a}(\nu)=\mu^{\mathcal{A}}(a^{*})-\mu^{\mathcal{A}}(a)roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_ν ) = italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a ) and minimal sub-optimality gap Δmin(ν)=mina𝒜μ𝒜(a)μ𝒜(a)subscriptΔ𝜈subscript𝑎𝒜superscript𝜇𝒜superscript𝑎superscript𝜇𝒜𝑎\Delta_{\min}(\nu)=\min_{a\in\mathcal{A}}\mu^{\mathcal{A}}(a^{*})-\mu^{% \mathcal{A}}(a)roman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_ν ) = roman_min start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a ) respectively with regards to the underlying instance ν𝜈\nuitalic_ν.

B.1 Prerequisite

Lemma B.1.

(In-phase concentration) For any phase \ellroman_ℓ, let

Ephase(δ)={|μ^𝒵μ𝒵,ν~a|2εdν~+4dν~mlog(2|𝒜|log2(T)δ),a𝒜}subscriptsuperscript𝐸phase𝛿formulae-sequencesubscriptsuperscript^𝜇𝒵superscript𝜇𝒵subscript~𝜈𝑎2𝜀subscript𝑑~𝜈4subscript𝑑~𝜈subscript𝑚2𝒜subscript2𝑇𝛿for-all𝑎subscript𝒜\displaystyle E^{\mathrm{phase}}_{\ell}(\delta)=\mathopen{}\left\{|\langle\hat% {\mu}^{\mathcal{Z}}_{\ell}-\mu^{\mathcal{Z}},\tilde{\nu}_{a}\rangle|\leq 2% \varepsilon\sqrt{d_{\tilde{\nu}}}+\sqrt{\frac{4d_{\tilde{\nu}}}{m_{\ell}}\log% \mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)},\forall a\in% \mathcal{A}_{\ell}\right\}italic_E start_POSTSUPERSCRIPT roman_phase end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_δ ) = { | ⟨ over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ | ≤ 2 italic_ε square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG + square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG , ∀ italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT }

and subscript\mathscr{F}_{\ell}script_F start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT be the σlimit-from𝜎\sigma-italic_σ -algebra generated by the history up to the start of phase \ellroman_ℓ. Then [Ephase(δ)|]1δlog2(T)delimited-[]conditionalsubscriptsuperscript𝐸phase𝛿subscript1𝛿subscript2𝑇\mathbb{P}[E^{\mathrm{phase}}_{\ell}(\delta)|\mathscr{F}_{\ell}]\geq 1-\frac{% \delta}{\log_{2}(T)}blackboard_P [ italic_E start_POSTSUPERSCRIPT roman_phase end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_δ ) | script_F start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ] ≥ 1 - divide start_ARG italic_δ end_ARG start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG.

Proof of Lemma B.1.

Let ba=νaν~a,μ𝒵,a𝒜formulae-sequencesubscript𝑏𝑎subscript𝜈𝑎subscript~𝜈𝑎superscript𝜇𝒵for-all𝑎𝒜b_{a}=\langle\nu_{a}-\tilde{\nu}_{a},\mu^{\mathcal{Z}}\rangle,\forall a\in% \mathcal{A}italic_b start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = ⟨ italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ⟩ , ∀ italic_a ∈ caligraphic_A be the error term due to the use of inaccurate marginals, then we know that |ba|ε,a𝒜formulae-sequencesubscript𝑏𝑎𝜀for-all𝑎𝒜|b_{a}|\leq\varepsilon,\forall a\in\mathcal{A}| italic_b start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | ≤ italic_ε , ∀ italic_a ∈ caligraphic_A since ν(Z)𝜈𝑍\nu(Z)italic_ν ( italic_Z ) and ν~(Z)~𝜈𝑍\tilde{\nu}(Z)over~ start_ARG italic_ν end_ARG ( italic_Z ) are εlimit-from𝜀\varepsilon-italic_ε -close. Observe that

μ^𝒵μ𝒵,ν~asubscriptsuperscript^𝜇𝒵superscript𝜇𝒵subscript~𝜈𝑎\displaystyle\langle\hat{\mu}^{\mathcal{Z}}_{\ell}-\mu^{\mathcal{Z}},\tilde{% \nu}_{a}\rangle⟨ over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ =V1tphase ν~Atν~Atμ𝒵,ν~aμ𝒵,ν~aabsentsuperscriptsubscript𝑉1subscript𝑡phase subscript~𝜈subscript𝐴𝑡superscriptsubscript~𝜈subscript𝐴𝑡topsuperscript𝜇𝒵subscript~𝜈𝑎superscript𝜇𝒵subscript~𝜈𝑎\displaystyle=\langle V_{\ell}^{-1}\sum_{t\in\text{phase }\ell}\tilde{\nu}_{A_% {t}}\tilde{\nu}_{A_{t}}^{\top}\mu^{\mathcal{Z}},\tilde{\nu}_{a}\rangle-\langle% \mu^{\mathcal{Z}},\tilde{\nu}_{a}\rangle= ⟨ italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ - ⟨ italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩
+V1tphase ν~Atηt𝒜,ν~a+V1tphase ν~AtbAt,ν~asuperscriptsubscript𝑉1subscript𝑡phase subscript~𝜈subscript𝐴𝑡subscriptsuperscript𝜂𝒜𝑡subscript~𝜈𝑎superscriptsubscript𝑉1subscript𝑡phase subscript~𝜈subscript𝐴𝑡subscript𝑏subscript𝐴𝑡subscript~𝜈𝑎\displaystyle+\langle V_{\ell}^{-1}\sum_{t\in\text{phase }\ell}\tilde{\nu}_{A_% {t}}\eta^{\mathcal{A}}_{t},\tilde{\nu}_{a}\rangle+\langle V_{\ell}^{-1}\sum_{t% \in\text{phase }\ell}\tilde{\nu}_{A_{t}}b_{A_{t}},\tilde{\nu}_{a}\rangle+ ⟨ italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ + ⟨ italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩
=tphase V1ν~At,ν~aηt𝒜+tphase V1ν~At,ν~abAt.absentsubscript𝑡phase superscriptsubscript𝑉1subscript~𝜈subscript𝐴𝑡subscript~𝜈𝑎subscriptsuperscript𝜂𝒜𝑡subscript𝑡phase superscriptsubscript𝑉1subscript~𝜈subscript𝐴𝑡subscript~𝜈𝑎subscript𝑏subscript𝐴𝑡\displaystyle=\sum_{t\in\text{phase }\ell}\langle V_{\ell}^{-1}\tilde{\nu}_{A_% {t}},\tilde{\nu}_{a}\rangle\eta^{\mathcal{A}}_{t}+\sum_{t\in\text{phase }\ell}% \langle V_{\ell}^{-1}\tilde{\nu}_{A_{t}},\tilde{\nu}_{a}\rangle b_{A_{t}}.= ∑ start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT ⟨ italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ italic_η start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT ⟨ italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ italic_b start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

Using Cauchy-Schwarz inequality and the fact that for all a𝒜,ν~aV121mν~aV(π)122dν~mformulae-sequence𝑎subscript𝒜superscriptsubscriptnormsubscript~𝜈𝑎superscriptsubscript𝑉121subscript𝑚superscriptsubscriptnormsubscript~𝜈𝑎𝑉superscriptsubscript𝜋122subscript𝑑~𝜈subscript𝑚a\in\mathcal{A}_{\ell},\|\tilde{\nu}_{a}\|_{V_{\ell}^{-1}}^{2}\leq\frac{1}{m_{% \ell}}\|\tilde{\nu}_{a}\|_{V(\pi_{\ell})^{-1}}^{2}\leq\frac{2d_{\tilde{\nu}}}{% m_{\ell}}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , ∥ over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG ∥ over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V ( italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 2 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG, the second term on the RHS of the above equality can be bounded by

|tphase V1ν~At,ν~abAt|subscript𝑡phase superscriptsubscript𝑉1subscript~𝜈subscript𝐴𝑡subscript~𝜈𝑎subscript𝑏subscript𝐴𝑡\displaystyle\mathopen{}\left|\sum_{t\in\text{phase }\ell}\langle V_{\ell}^{-1% }\tilde{\nu}_{A_{t}},\tilde{\nu}_{a}\rangle b_{A_{t}}\right|| ∑ start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT ⟨ italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ italic_b start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT | εtphase |V1ν~At,ν~a|absent𝜀subscript𝑡phase superscriptsubscript𝑉1subscript~𝜈subscript𝐴𝑡subscript~𝜈𝑎\displaystyle\leq\varepsilon\sum_{t\in\text{phase }\ell}\mathopen{}\left|% \langle V_{\ell}^{-1}\tilde{\nu}_{A_{t}},\tilde{\nu}_{a}\rangle\right|≤ italic_ε ∑ start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT | ⟨ italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ |
ε(tphase 1)(tphase V1ν~At,ν~a2)absent𝜀subscript𝑡phase 1subscript𝑡phase superscriptsuperscriptsubscript𝑉1subscript~𝜈subscript𝐴𝑡subscript~𝜈𝑎2\displaystyle\leq\varepsilon\sqrt{\mathopen{}\left(\sum_{t\in\text{phase }\ell% }1\right)\mathopen{}\left(\sum_{t\in\text{phase }\ell}\langle V_{\ell}^{-1}% \tilde{\nu}_{A_{t}},\tilde{\nu}_{a}\rangle^{2}\right)}≤ italic_ε square-root start_ARG ( ∑ start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT 1 ) ( ∑ start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT ⟨ italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG
=εTν~aV12ε2m2dν~m=2εdν~.absent𝜀subscript𝑇superscriptsubscriptnormsubscript~𝜈𝑎superscriptsubscript𝑉12𝜀2subscript𝑚2subscript𝑑~𝜈subscript𝑚2𝜀subscript𝑑~𝜈\displaystyle=\varepsilon\sqrt{T_{\ell}\|\tilde{\nu}_{a}\|_{V_{\ell}^{-1}}^{2}% }\leq\varepsilon\sqrt{2m_{\ell}\frac{2d_{\tilde{\nu}}}{m_{\ell}}}=2\varepsilon% \sqrt{d_{\tilde{\nu}}}.= italic_ε square-root start_ARG italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∥ over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ italic_ε square-root start_ARG 2 italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT divide start_ARG 2 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG end_ARG = 2 italic_ε square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG .

To bound the first term, notice that (At)tphase ,Vsubscriptsubscript𝐴𝑡𝑡phase subscript𝑉(A_{t})_{t\in\text{phase }\ell},V_{\ell}( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT are fixed given the history prior to the start of phase \ellroman_ℓ. Hence (ηt𝒜)tphase subscriptsubscriptsuperscript𝜂𝒜𝑡𝑡phase (\eta^{\mathcal{A}}_{t})_{t\in\text{phase }\ell}( italic_η start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT are independent conditioned on subscript\mathscr{F}_{\ell}script_F start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT and bounded by [1,1]11[-1,1][ - 1 , 1 ]. By standard concentration bounds, we have that with probability at least 1δ|𝒜|log2(T)1𝛿𝒜subscript2𝑇1-\frac{\delta}{|\mathcal{A}|\log_{2}(T)}1 - divide start_ARG italic_δ end_ARG start_ARG | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG,

|tphase V1νAt,ν~aηt𝒜|subscript𝑡phase superscriptsubscript𝑉1subscript𝜈subscript𝐴𝑡subscript~𝜈𝑎subscriptsuperscript𝜂𝒜𝑡\displaystyle\mathopen{}\left|\sum_{t\in\text{phase }\ell}\langle V_{\ell}^{-1% }\nu_{A_{t}},\tilde{\nu}_{a}\rangle\eta^{\mathcal{A}}_{t}\right|| ∑ start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT ⟨ italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ italic_η start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | 2tphase V1ν~At,ν~a2log(2|𝒜|log2(T)δ),absent2subscript𝑡phase superscriptsuperscriptsubscript𝑉1subscript~𝜈subscript𝐴𝑡subscript~𝜈𝑎22𝒜subscript2𝑇𝛿\displaystyle\leq\sqrt{2\sum_{t\in\text{phase }\ell}\langle V_{\ell}^{-1}% \tilde{\nu}_{A_{t}},\tilde{\nu}_{a}\rangle^{2}\log\mathopen{}\left(\frac{2|% \mathcal{A}|\log_{2}(T)}{\delta}\right)},≤ square-root start_ARG 2 ∑ start_POSTSUBSCRIPT italic_t ∈ phase roman_ℓ end_POSTSUBSCRIPT ⟨ italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG ,

where the RHS can be rewritten as

2ν~aV12log(2|𝒜|log2(T)δ)4dν~mlog(2|𝒜|log2(T)δ).2superscriptsubscriptnormsubscript~𝜈𝑎superscriptsubscript𝑉122𝒜subscript2𝑇𝛿4subscript𝑑~𝜈subscript𝑚2𝒜subscript2𝑇𝛿\displaystyle\sqrt{2\|\tilde{\nu}_{a}\|_{V_{\ell}^{-1}}^{2}\log\mathopen{}% \left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)}\leq\sqrt{\frac{4d_{% \tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{% \delta}\right)}.square-root start_ARG 2 ∥ over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG ≤ square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG .

Combining the two upper bounds above and taking a union bound over all a𝒜𝑎subscript𝒜a\in\mathcal{A}_{\ell}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT, we have that with probability at least 1δlog2(T)1𝛿subscript2𝑇1-\frac{\delta}{\log_{2}(T)}1 - divide start_ARG italic_δ end_ARG start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG,

|μ^𝒵μ𝒵,ν~a|2εdν~+4dν~mlog(2|𝒜|log2(T)δ),a𝒜,formulae-sequencesubscriptsuperscript^𝜇𝒵superscript𝜇𝒵subscript~𝜈𝑎2𝜀subscript𝑑~𝜈4subscript𝑑~𝜈subscript𝑚2𝒜subscript2𝑇𝛿for-all𝑎subscript𝒜\displaystyle|\langle\hat{\mu}^{\mathcal{Z}}_{\ell}-\mu^{\mathcal{Z}},\tilde{% \nu}_{a}\rangle|\leq 2\varepsilon\sqrt{d_{\tilde{\nu}}}+\sqrt{\frac{4d_{\tilde% {\nu}}}{m_{\ell}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}% \right)},\quad\forall a\in\mathcal{A}_{\ell},| ⟨ over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ | ≤ 2 italic_ε square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG + square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG , ∀ italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ,

which finishes the proof. ∎

Since the marginal distributions ν~asubscript~𝜈𝑎\tilde{\nu}_{a}over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT are possibly not accurate, we may not be able to show that the optimal action asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is never eliminated with high probability. So what we can hope for is that actions that are near-optimal relative to the best action in 𝒜subscript𝒜\mathcal{A}_{\ell}caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT are retained in the end of the phase \ellroman_ℓ. To be concrete, define aargmina𝒜Δasubscriptsuperscript𝑎subscriptarg𝑎subscript𝒜subscriptΔ𝑎a^{*}_{\ell}\in\operatorname*{arg\!\min}_{a\in\mathcal{A}_{\ell}}\Delta_{a}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to be the true optimal action within 𝒜subscript𝒜\mathcal{A}_{\ell}caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT. Then we can show that ΔaΔasubscriptΔ𝑎subscriptΔsubscriptsuperscript𝑎\Delta_{a}-\Delta_{a^{*}_{\ell}}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT is rather small for any a𝑎aitalic_a that is not eliminated in the end of phase \ellroman_ℓ.

Lemma B.2.

Conditioning on event Ephase(δ)subscriptsuperscript𝐸phase𝛿E^{\mathrm{phase}}_{\ell}(\delta)italic_E start_POSTSUPERSCRIPT roman_phase end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_δ ), for any action a𝑎aitalic_a not eliminated in the end of phase \ellroman_ℓ, it has relative sub-optimality gap μ𝒵,νaνa=ΔaΔa2ε(1+2dν~)+44dν~mlog(2|𝒜|log2(T)δ)superscript𝜇𝒵subscript𝜈subscriptsuperscript𝑎subscript𝜈𝑎subscriptΔ𝑎subscriptΔsubscriptsuperscript𝑎2𝜀12subscript𝑑~𝜈44subscript𝑑~𝜈subscript𝑚2𝒜subscript2𝑇𝛿\langle\mu^{\mathcal{Z}},\nu_{a^{*}_{\ell}}-\nu_{a}\rangle=\Delta_{a}-\Delta_{% a^{*}_{\ell}}\leq 2\varepsilon(1+2\sqrt{d_{\tilde{\nu}}})+4\sqrt{\frac{4d_{% \tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{% \delta}\right)}⟨ italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT , italic_ν start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ = roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 2 italic_ε ( 1 + 2 square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG ) + 4 square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG.

Proof of Lemma B.2.

According to the rule of updating active set, whenever a𝒜𝑎subscript𝒜a\in\mathcal{A}_{\ell}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT is not eliminated at the end of phase \ellroman_ℓ, it holds

μ^𝒵,ν~aν~amaxb𝒜μ^𝒵,ν~bν~a24dν~mlog(2|𝒜|log2(T)δ).subscriptsuperscript^𝜇𝒵subscript~𝜈subscriptsuperscript𝑎subscript~𝜈𝑎subscript𝑏subscript𝒜subscriptsuperscript^𝜇𝒵subscript~𝜈𝑏subscript~𝜈𝑎24subscript𝑑~𝜈subscript𝑚2𝒜subscript2𝑇𝛿\displaystyle\langle\hat{\mu}^{\mathcal{Z}}_{\ell},\tilde{\nu}_{a^{*}_{\ell}}-% \tilde{\nu}_{a}\rangle\leq\max_{b\in\mathcal{A}_{\ell}}\langle\hat{\mu}^{% \mathcal{Z}}_{\ell},\tilde{\nu}_{b}-\tilde{\nu}_{a}\rangle\leq 2\sqrt{\frac{4d% _{\tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}% {\delta}\right)}.⟨ over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ ≤ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ ≤ 2 square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG .

It implies that

μ𝒵,ν~aν~asuperscript𝜇𝒵subscript~𝜈subscriptsuperscript𝑎subscript~𝜈𝑎\displaystyle\langle\mu^{\mathcal{Z}},\tilde{\nu}_{a^{*}_{\ell}}-\tilde{\nu}_{% a}\rangle⟨ italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ =μ𝒵μ^𝒵,ν~aν~a+μ^𝒵,ν~aν~aabsentsuperscript𝜇𝒵subscriptsuperscript^𝜇𝒵subscript~𝜈subscriptsuperscript𝑎subscript~𝜈𝑎subscriptsuperscript^𝜇𝒵subscript~𝜈subscriptsuperscript𝑎subscript~𝜈𝑎\displaystyle=\langle\mu^{\mathcal{Z}}-\hat{\mu}^{\mathcal{Z}}_{\ell},\tilde{% \nu}_{a^{*}_{\ell}}-\tilde{\nu}_{a}\rangle+\langle\hat{\mu}^{\mathcal{Z}}_{% \ell},\tilde{\nu}_{a^{*}_{\ell}}-\tilde{\nu}_{a}\rangle= ⟨ italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ + ⟨ over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩
2(4dν~mlog(2|𝒜|log2(T)δ)+2εdν~)+24dν~mlog(2|𝒜|log2(T)δ)absent24subscript𝑑~𝜈subscript𝑚2𝒜subscript2𝑇𝛿2𝜀subscript𝑑~𝜈24subscript𝑑~𝜈subscript𝑚2𝒜subscript2𝑇𝛿\displaystyle\leq 2\mathopen{}\Big{(}\sqrt{\frac{4d_{\tilde{\nu}}}{m_{\ell}}% \log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)}+2% \varepsilon\sqrt{d_{\tilde{\nu}}}\Big{)}+2\sqrt{\frac{4d_{\tilde{\nu}}}{m_{% \ell}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)}≤ 2 ( square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_ε square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG ) + 2 square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG
=44dν~mlog(2|𝒜|log2(T)δ)+4εdν~.absent44subscript𝑑~𝜈subscript𝑚2𝒜subscript2𝑇𝛿4𝜀subscript𝑑~𝜈\displaystyle=4\sqrt{\frac{4d_{\tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(% \frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)}+4\varepsilon\sqrt{d_{\tilde{% \nu}}}.= 4 square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG + 4 italic_ε square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG .

where we use the fact that we are conditioning on Ephase(δ)subscriptsuperscript𝐸phase𝛿E^{\mathrm{phase}}_{\ell}(\delta)italic_E start_POSTSUPERSCRIPT roman_phase end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_δ ) in the inequality. Hence under the true marginals ν𝜈\nuitalic_ν,

μ𝒵,νaνa44dν~mlog(2|𝒜|log2(T)δ)+4εdν~+2ε.superscript𝜇𝒵subscript𝜈subscriptsuperscript𝑎subscript𝜈𝑎44subscript𝑑~𝜈subscript𝑚2𝒜subscript2𝑇𝛿4𝜀subscript𝑑~𝜈2𝜀\displaystyle\langle\mu^{\mathcal{Z}},\nu_{a^{*}_{\ell}}-\nu_{a}\rangle\leq 4% \sqrt{\frac{4d_{\tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(\frac{2|\mathcal{A% }|\log_{2}(T)}{\delta}\right)}+4\varepsilon\sqrt{d_{\tilde{\nu}}}+2\varepsilon.⟨ italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT , italic_ν start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ ≤ 4 square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG + 4 italic_ε square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG + 2 italic_ε .

Now we need to track ΔasubscriptΔsubscriptsuperscript𝑎\Delta_{a^{*}_{\ell}}roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the sub-optimality of the best active action in each phase. Observe that Δa=k=11(Δak+1Δak)subscriptΔsubscriptsuperscript𝑎superscriptsubscript𝑘11subscriptΔsubscriptsuperscript𝑎𝑘1subscriptΔsubscriptsuperscript𝑎𝑘\Delta_{a^{*}_{\ell}}=\sum_{k=1}^{\ell-1}(\Delta_{a^{*}_{k+1}}-\Delta_{a^{*}_{% k}})roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) since Δa1=Δa=0subscriptΔsubscriptsuperscript𝑎1subscriptΔsuperscript𝑎0\Delta_{a^{*}_{1}}=\Delta_{a^{*}}=0roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 0. Then it suffices to control each Δak+1Δak,k1subscriptΔsubscriptsuperscript𝑎𝑘1subscriptΔsubscriptsuperscript𝑎𝑘𝑘1\Delta_{a^{*}_{k+1}}-\Delta_{a^{*}_{k}},k\leq\ell-1roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_k ≤ roman_ℓ - 1, to control the growth of ΔasubscriptΔsubscriptsuperscript𝑎\Delta_{a^{*}_{\ell}}roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Lemma B.3.

Conditioning on event Ephase(δ)subscriptsuperscript𝐸phase𝛿E^{\mathrm{phase}}_{\ell}(\delta)italic_E start_POSTSUPERSCRIPT roman_phase end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_δ ), we have Δa+1Δa2ε(1+2dν)subscriptΔsubscriptsuperscript𝑎1subscriptΔsubscriptsuperscript𝑎2𝜀12subscript𝑑𝜈\Delta_{a^{*}_{\ell+1}}-\Delta_{a^{*}_{\ell}}\leq 2\varepsilon(1+2\sqrt{d_{\nu% }})roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 2 italic_ε ( 1 + 2 square-root start_ARG italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT end_ARG ).

Proof of Lemma B.3.

Suppose Ephase(δ)subscriptsuperscript𝐸phase𝛿E^{\mathrm{phase}}_{\ell}(\delta)italic_E start_POSTSUPERSCRIPT roman_phase end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_δ ) happens. Notice that the results holds trivially if asubscriptsuperscript𝑎a^{*}_{\ell}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT is not eliminated in the end of phase \ellroman_ℓ, because in this case a+1=asubscriptsuperscript𝑎1subscriptsuperscript𝑎a^{*}_{\ell+1}=a^{*}_{\ell}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT. On the other hand, if asubscriptsuperscript𝑎a^{*}_{\ell}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT is eliminated, define a^argmaxa𝒜μ^𝒵,ν~asubscript^𝑎subscriptarg𝑎subscript𝒜subscriptsuperscript^𝜇𝒵subscript~𝜈𝑎\hat{a}_{\ell}\in\operatorname*{arg\!\max}_{a\in\mathcal{A}_{\ell}}\langle\hat% {\mu}^{\mathcal{Z}}_{\ell},\tilde{\nu}_{a}\rangleover^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ to be the empirically best action in the end of phase \ellroman_ℓ and then we have

μ^𝒵,ν~a^ν~a>24dν~mlog(2|𝒜|log2(T)δ),subscriptsuperscript^𝜇𝒵subscript~𝜈subscript^𝑎subscript~𝜈subscriptsuperscript𝑎24subscript𝑑~𝜈subscript𝑚2𝒜subscript2𝑇𝛿\displaystyle\langle\hat{\mu}^{\mathcal{Z}}_{\ell},\tilde{\nu}_{\hat{a}_{\ell}% }-\tilde{\nu}_{a^{*}_{\ell}}\rangle>2\sqrt{\frac{4d_{\tilde{\nu}}}{m_{\ell}}% \log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)},⟨ over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ > 2 square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG ,

according to the test performed. In the meantime, recall that due to in-phase concentration and εlimit-from𝜀\varepsilon-italic_ε -closeness between ν~~𝜈\tilde{\nu}over~ start_ARG italic_ν end_ARG and ν𝜈\nuitalic_ν,

μ^𝒵,ν~a^ν~asubscriptsuperscript^𝜇𝒵subscript~𝜈subscript^𝑎subscript~𝜈subscriptsuperscript𝑎\displaystyle\langle\hat{\mu}^{\mathcal{Z}}_{\ell},\tilde{\nu}_{\hat{a}_{\ell}% }-\tilde{\nu}_{a^{*}_{\ell}}\rangle⟨ over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ μ𝒵,ν~a^ν~a+4εdν~+24dν~mlog(2|𝒜|log2(T)δ)absentsuperscript𝜇𝒵subscript~𝜈subscript^𝑎subscript~𝜈subscriptsuperscript𝑎4𝜀subscript𝑑~𝜈24subscript𝑑~𝜈subscript𝑚2𝒜subscript2𝑇𝛿\displaystyle\leq\langle\mu^{\mathcal{Z}},\tilde{\nu}_{\hat{a}_{\ell}}-\tilde{% \nu}_{a^{*}_{\ell}}\rangle+4\varepsilon\sqrt{d_{\tilde{\nu}}}+2\sqrt{\frac{4d_% {\tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{% \delta}\right)}≤ ⟨ italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT , over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over~ start_ARG italic_ν end_ARG start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ + 4 italic_ε square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG + 2 square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG
μ𝒵,νa^νa+2ε+4εdν~+24dν~mlog(2|𝒜|log2(T)δ).absentsuperscript𝜇𝒵subscript𝜈subscript^𝑎subscript𝜈subscriptsuperscript𝑎2𝜀4𝜀subscript𝑑~𝜈24subscript𝑑~𝜈subscript𝑚2𝒜subscript2𝑇𝛿\displaystyle\leq\langle\mu^{\mathcal{Z}},\nu_{\hat{a}_{\ell}}-\nu_{a^{*}_{% \ell}}\rangle+2\varepsilon+4\varepsilon\sqrt{d_{\tilde{\nu}}}+2\sqrt{\frac{4d_% {\tilde{\nu}}}{m_{\ell}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{% \delta}\right)}.≤ ⟨ italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT , italic_ν start_POSTSUBSCRIPT over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_ν start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ + 2 italic_ε + 4 italic_ε square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG + 2 square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG .

Hence we get

Δa^Δa=μ𝒵,νaνa^2ε+4εdν~subscriptΔsubscript^𝑎subscriptΔsubscriptsuperscript𝑎superscript𝜇𝒵subscript𝜈subscriptsuperscript𝑎subscript𝜈subscript^𝑎2𝜀4𝜀subscript𝑑~𝜈\displaystyle\Delta_{\hat{a}_{\ell}}-\Delta_{a^{*}_{\ell}}=\langle\mu^{% \mathcal{Z}},\nu_{a^{*}_{\ell}}-\nu_{\hat{a}_{\ell}}\rangle\leq 2\varepsilon+4% \varepsilon\sqrt{d_{\tilde{\nu}}}roman_Δ start_POSTSUBSCRIPT over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ⟨ italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT , italic_ν start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_ν start_POSTSUBSCRIPT over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ≤ 2 italic_ε + 4 italic_ε square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG

and

Δa+1ΔaΔa^Δa2ε+4εdν~.subscriptΔsubscriptsuperscript𝑎1subscriptΔsubscriptsuperscript𝑎subscriptΔsubscript^𝑎subscriptΔsubscriptsuperscript𝑎2𝜀4𝜀subscript𝑑~𝜈\displaystyle\Delta_{a^{*}_{\ell+1}}-\Delta_{a^{*}_{\ell}}\leq\Delta_{\hat{a}_% {\ell}}-\Delta_{a^{*}_{\ell}}\leq 2\varepsilon+4\varepsilon\sqrt{d_{\tilde{\nu% }}}.roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ roman_Δ start_POSTSUBSCRIPT over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 2 italic_ε + 4 italic_ε square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG .

Corollary B.4.

For any 22\ell\geq 2roman_ℓ ≥ 2 and conditioning on k1Ekphase(δ)subscript𝑘1subscriptsuperscript𝐸phase𝑘𝛿\bigcap_{k\leq\ell-1}E^{\mathrm{phase}}_{k}(\delta)⋂ start_POSTSUBSCRIPT italic_k ≤ roman_ℓ - 1 end_POSTSUBSCRIPT italic_E start_POSTSUPERSCRIPT roman_phase end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_δ ), we have that Δa2ε(1)(1+2dν~)subscriptΔsubscriptsuperscript𝑎2𝜀112subscript𝑑~𝜈\Delta_{a^{*}_{\ell}}\leq 2\varepsilon(\ell-1)(1+2\sqrt{d_{\tilde{\nu}}})roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 2 italic_ε ( roman_ℓ - 1 ) ( 1 + 2 square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG ) and Δa2ε(1+2dν~)+44dν~m1log(2|𝒜|log2(T)δ)+2ε(2)(1+2dν~)subscriptΔ𝑎2𝜀12subscript𝑑~𝜈44subscript𝑑~𝜈subscript𝑚12𝒜subscript2𝑇𝛿2𝜀212subscript𝑑~𝜈\Delta_{a}\leq 2\varepsilon(1+2\sqrt{d_{\tilde{\nu}}})+4\sqrt{\frac{4d_{\tilde% {\nu}}}{m_{\ell-1}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{% \delta}\right)}+2\varepsilon(\ell-2)(1+2\sqrt{d_{\tilde{\nu}}})roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≤ 2 italic_ε ( 1 + 2 square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG ) + 4 square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_ε ( roman_ℓ - 2 ) ( 1 + 2 square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG ) for all a𝒜𝑎subscript𝒜a\in\mathcal{A}_{\ell}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT.

Proof of Corollary B.4.

By conditioning on the intersection of all Ekphase(δ),k1subscriptsuperscript𝐸phase𝑘𝛿𝑘1E^{\mathrm{phase}}_{k}(\delta),k\leq\ell-1italic_E start_POSTSUPERSCRIPT roman_phase end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_δ ) , italic_k ≤ roman_ℓ - 1, we have that

Δak+1Δak2ε(1+2dν~),k1,formulae-sequencesubscriptΔsubscriptsuperscript𝑎𝑘1subscriptΔsubscriptsuperscript𝑎𝑘2𝜀12subscript𝑑~𝜈for-all𝑘1\displaystyle\Delta_{a^{*}_{k+1}}-\Delta_{a^{*}_{k}}\leq 2\varepsilon(1+2\sqrt% {d_{\tilde{\nu}}}),\forall k\leq\ell-1,roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 2 italic_ε ( 1 + 2 square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG ) , ∀ italic_k ≤ roman_ℓ - 1 ,

which implies that Δas=k=1s1(Δak+1Δak)2ε(s1)(1+2dν~),sformulae-sequencesubscriptΔsubscriptsuperscript𝑎𝑠superscriptsubscript𝑘1𝑠1subscriptΔsubscriptsuperscript𝑎𝑘1subscriptΔsubscriptsuperscript𝑎𝑘2𝜀𝑠112subscript𝑑~𝜈for-all𝑠\Delta_{a^{*}_{s}}=\sum_{k=1}^{s-1}(\Delta_{a^{*}_{k+1}}-\Delta_{a^{*}_{k}})% \leq 2\varepsilon(s-1)(1+2\sqrt{d_{\tilde{\nu}}}),\forall s\leq\ellroman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s - 1 end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≤ 2 italic_ε ( italic_s - 1 ) ( 1 + 2 square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG ) , ∀ italic_s ≤ roman_ℓ. In particular, there is

Δa2ε(1)(1+2dν~).subscriptΔsubscriptsuperscript𝑎2𝜀112subscript𝑑~𝜈\displaystyle\Delta_{a^{*}_{\ell}}\leq 2\varepsilon(\ell-1)(1+2\sqrt{d_{\tilde% {\nu}}}).roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 2 italic_ε ( roman_ℓ - 1 ) ( 1 + 2 square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG ) .

Since every action a𝒜𝑎subscript𝒜a\in\mathcal{A}_{\ell}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT passes the test in the end of (1)limit-from1(\ell-1)-( roman_ℓ - 1 ) -th phase and hence is not eliminated, by Lemma B.2 we know

ΔaΔa12ε(1+2dν~)+44dν~m1log(2|𝒜|log2(T)δ).subscriptΔ𝑎subscriptΔsubscriptsuperscript𝑎12𝜀12subscript𝑑~𝜈44subscript𝑑~𝜈subscript𝑚12𝒜subscript2𝑇𝛿\displaystyle\Delta_{a}-\Delta_{a^{*}_{\ell-1}}\leq 2\varepsilon(1+2\sqrt{d_{% \tilde{\nu}}})+4\sqrt{\frac{4d_{\tilde{\nu}}}{m_{\ell-1}}\log\mathopen{}\left(% \frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)}.roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 2 italic_ε ( 1 + 2 square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG ) + 4 square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG .

Therefore, for all a𝒜𝑎subscript𝒜a\in\mathcal{A}_{\ell}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT,

Δa=ΔaΔa1+Δa12ε(1+2dν~)+44dν~m1log(2|𝒜|log2(T)δ)+2ε(2)(1+2dν~).subscriptΔ𝑎subscriptΔ𝑎subscriptΔsubscriptsuperscript𝑎1subscriptΔsubscriptsuperscript𝑎12𝜀12subscript𝑑~𝜈44subscript𝑑~𝜈subscript𝑚12𝒜subscript2𝑇𝛿2𝜀212subscript𝑑~𝜈\displaystyle\Delta_{a}=\Delta_{a}-\Delta_{a^{*}_{\ell-1}}+\Delta_{a^{*}_{\ell% -1}}\leq 2\varepsilon(1+2\sqrt{d_{\tilde{\nu}}})+4\sqrt{\frac{4d_{\tilde{\nu}}% }{m_{\ell-1}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}% \right)}+2\varepsilon(\ell-2)(1+2\sqrt{d_{\tilde{\nu}}}).roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 2 italic_ε ( 1 + 2 square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG ) + 4 square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_ε ( roman_ℓ - 2 ) ( 1 + 2 square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG ) .

B.2 Proof of Theorem 5.5

Now we are prepared to prove Theorem 5.5.

Proof of Theorem 5.5.

Let max(t)subscriptmax𝑡\ell_{\text{max}}(t)roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_t ) be the index of the phase where round t𝑡titalic_t is located. It’s easy to see that max(T)log2(T)subscriptmax𝑇subscript2𝑇\ell_{\text{max}}(T)\leq\log_{2}(T)roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_T ) ≤ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ). In the following we condition on the event max(T)Ephase(δ)subscriptsubscriptmax𝑇subscriptsuperscript𝐸phase𝛿\bigcap_{\ell\leq\ell_{\text{max}}(T)}E^{\mathrm{phase}}_{\ell}(\delta)⋂ start_POSTSUBSCRIPT roman_ℓ ≤ roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_T ) end_POSTSUBSCRIPT italic_E start_POSTSUPERSCRIPT roman_phase end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_δ ), which happens with probability at least 1δ1𝛿1-\delta1 - italic_δ due to Lemma B.1.

Notice that phase max(t)subscriptmax𝑡\ell_{\text{max}}(t)roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_t ) is not necessarily completed in the end of round t𝑡titalic_t, but we can always round Reg(t)Reg𝑡\mathrm{Reg}(t)roman_Reg ( italic_t ) to the regret incurred in the first max(t)subscriptmax𝑡\ell_{\text{max}}(t)roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_t ) complete phases. That is,

Reg(t)=1max(t)a𝒜T(a)Δa.Reg𝑡superscriptsubscript1subscriptmax𝑡subscript𝑎subscript𝒜subscript𝑇𝑎subscriptΔ𝑎\displaystyle\mathrm{Reg}(t)\leq\sum_{\ell=1}^{\ell_{\text{max}}(t)}\sum_{a\in% \mathcal{A}_{\ell}}T_{\ell}(a)\cdot\Delta_{a}.roman_Reg ( italic_t ) ≤ ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a ) ⋅ roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT .

Since we have controlled sub-optimality of all active actions in Corollary B.4, it holds with probability at least 1δ1𝛿1-\delta1 - italic_δ that

Reg(t)Reg𝑡\displaystyle\mathrm{Reg}(t)roman_Reg ( italic_t ) =1max(t)a𝒜T(a)Δaabsentsuperscriptsubscript1subscriptmax𝑡subscript𝑎subscript𝒜subscript𝑇𝑎subscriptΔ𝑎\displaystyle\leq\sum_{\ell=1}^{\ell_{\text{max}}(t)}\sum_{a\in\mathcal{A}_{% \ell}}T_{\ell}(a)\cdot\Delta_{a}≤ ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a ) ⋅ roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
2m1+C=2max(t)m(dν~m1log(2|𝒜|log2(T)δ)+εdν~)absent2subscript𝑚1𝐶superscriptsubscript2subscriptmax𝑡subscript𝑚subscript𝑑~𝜈subscript𝑚12𝒜subscript2𝑇𝛿𝜀subscript𝑑~𝜈\displaystyle\leq 2m_{1}+C\sum_{\ell=2}^{\ell_{\text{max}}(t)}m_{\ell}% \mathopen{}\left(\sqrt{\frac{d_{\tilde{\nu}}}{m_{\ell-1}}\log\mathopen{}\left(% \frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)}+\varepsilon\ell\sqrt{d_{% \tilde{\nu}}}\right)≤ 2 italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_C ∑ start_POSTSUBSCRIPT roman_ℓ = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( square-root start_ARG divide start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG + italic_ε roman_ℓ square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG )
2m1+C=2max(t)mdν~log(2|𝒜|log2(T)δ)+Cεdν~=2max(t)mabsent2subscript𝑚1𝐶superscriptsubscript2subscriptmax𝑡subscript𝑚subscript𝑑~𝜈2𝒜subscript2𝑇𝛿𝐶𝜀subscript𝑑~𝜈superscriptsubscript2subscriptmax𝑡subscript𝑚\displaystyle\leq 2m_{1}+C\sum_{\ell=2}^{\ell_{\text{max}}(t)}\sqrt{m_{\ell}% \cdot d_{\tilde{\nu}}\cdot\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)% }{\delta}\right)}+C\varepsilon\sqrt{d_{\tilde{\nu}}}\sum_{\ell=2}^{\ell_{\text% {max}}(t)}m_{\ell}\ell≤ 2 italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_C ∑ start_POSTSUBSCRIPT roman_ℓ = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_t ) end_POSTSUPERSCRIPT square-root start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ⋅ italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT ⋅ roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG + italic_C italic_ε square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT roman_ℓ
2m1+Cmmax(t)dν~log(2|𝒜|log2(T)δ)+Cεdν~mmax(t)log2(T)absent2subscript𝑚1𝐶subscript𝑚subscriptmax𝑡subscript𝑑~𝜈2𝒜subscript2𝑇𝛿𝐶𝜀subscript𝑑~𝜈subscript𝑚subscriptmax𝑡subscript2𝑇\displaystyle\leq 2m_{1}+C\sqrt{m_{\ell_{\text{max}}(t)}\cdot d_{\tilde{\nu}}% \cdot\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)}+C% \varepsilon\sqrt{d_{\tilde{\nu}}}m_{\ell_{\text{max}}(t)}\log_{2}(T)≤ 2 italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_C square-root start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT ⋅ italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT ⋅ roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG + italic_C italic_ε square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG italic_m start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T )
C(dν~tlog(2|𝒜|logTδ)+εtdν~logT),absent𝐶subscript𝑑~𝜈𝑡2𝒜𝑇𝛿𝜀𝑡subscript𝑑~𝜈𝑇\displaystyle\leq C\mathopen{}\left(\sqrt{d_{\tilde{\nu}}t\log\mathopen{}\left% (\frac{2|\mathcal{A}|\log T}{\delta}\right)}+\varepsilon t\sqrt{d_{\tilde{\nu}% }}\log T\right),≤ italic_C ( square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT italic_t roman_log ( divide start_ARG 2 | caligraphic_A | roman_log italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + italic_ε italic_t square-root start_ARG italic_d start_POSTSUBSCRIPT over~ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT end_ARG roman_log italic_T ) ,

where C>0𝐶0C>0italic_C > 0 is an absolute constant that can vary from line to line. Thus we have finished the proof. ∎

B.3 Proof of Theorem 4.3

Now we go back to the setting where ε=0𝜀0\varepsilon=0italic_ε = 0. The only modification needed to work out Theorem 4.3 is an instance-dependent control over the number of phases for which sub-optimal arms are not entirely eliminated.

Proof of Theorem 4.3.

Again suppose Ephase(δ)subscriptsuperscript𝐸phase𝛿E^{\mathrm{phase}}_{\ell}(\delta)italic_E start_POSTSUPERSCRIPT roman_phase end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_δ ) happens for all \ellroman_ℓ. From Corollary B.4 we know that every suboptimal action a𝑎aitalic_a can only be played in those phase 22\ell\geq 2roman_ℓ ≥ 2 s.t. Δa44dνm1log(2|𝒜|log2(T)δ)subscriptΔ𝑎44subscript𝑑𝜈subscript𝑚12𝒜subscript2𝑇𝛿\Delta_{a}\leq 4\sqrt{\frac{4d_{\nu}}{m_{\ell-1}}\log\mathopen{}\left(\frac{2|% \mathcal{A}|\log_{2}(T)}{\delta}\right)}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≤ 4 square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG in addition to the first phase. Let

a=max{2:Δa44dνm1log(2|𝒜|log2(T)δ)}subscript𝑎:2subscriptΔ𝑎44subscript𝑑𝜈subscript𝑚12𝒜subscript2𝑇𝛿\displaystyle\ell_{a}=\max\mathopen{}\Big{\{}\ell\geq 2:\Delta_{a}\leq 4\sqrt{% \frac{4d_{\nu}}{m_{\ell-1}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T% )}{\delta}\right)}\Big{\}}roman_ℓ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = roman_max { roman_ℓ ≥ 2 : roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≤ 4 square-root start_ARG divide start_ARG 4 italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG }

be the maximal number of phases where a𝑎aitalic_a can be played. It is easy to see that

a=2+log2(64dνm1Δa2log(2|𝒜|log2(T)δ)).subscript𝑎2subscript264subscript𝑑𝜈subscript𝑚1superscriptsubscriptΔ𝑎22𝒜subscript2𝑇𝛿\displaystyle\ell_{a}=2+\mathopen{}\left\lfloor\log_{2}\mathopen{}\left(\frac{% 64d_{\nu}}{m_{1}\Delta_{a}^{2}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{% 2}(T)}{\delta}\right)\right)\right\rfloor.roman_ℓ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 2 + ⌊ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG 64 italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) ) ⌋ .

Hence there are at most max=2+log2(64dνm1Δmin2log(2|𝒜|log2(T)δ))subscriptmax2subscript264subscript𝑑𝜈subscript𝑚1superscriptsubscriptΔ22𝒜subscript2𝑇𝛿\ell_{\text{max}}=2+\mathopen{}\left\lfloor\log_{2}\mathopen{}\left(\frac{64d_% {\nu}}{m_{1}\Delta_{\min}^{2}}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2% }(T)}{\delta}\right)\right)\right\rfloorroman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = 2 + ⌊ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG 64 italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) ) ⌋ number of phases before all suboptimals are eliminated and Reg(T)Reg𝑇\mathrm{Reg}(T)roman_Reg ( italic_T ) can be controlled more carefully:

Reg(T)Reg𝑇\displaystyle\mathrm{Reg}(T)roman_Reg ( italic_T ) =1maxa𝒜T(a)Δaabsentsuperscriptsubscript1subscriptmaxsubscript𝑎subscript𝒜subscript𝑇𝑎subscriptΔ𝑎\displaystyle\leq\sum_{\ell=1}^{\ell_{\text{max}}}\sum_{a\in\mathcal{A}_{\ell}% }T_{\ell}(a)\cdot\Delta_{a}≤ ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a ) ⋅ roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
2m1+Cmmaxdνlog(2|𝒜|log2(T)δ)absent2subscript𝑚1𝐶subscript𝑚subscriptmaxsubscript𝑑𝜈2𝒜subscript2𝑇𝛿\displaystyle\leq 2m_{1}+C\sqrt{m_{\ell_{\text{max}}}\cdot d_{\nu}\cdot\log% \mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)}≤ 2 italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_C square-root start_ARG italic_m start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ⋅ roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG
=2m1+C2maxm1dνlog(2|𝒜|log2(T)δ)absent2subscript𝑚1𝐶superscript2subscriptmaxsubscript𝑚1subscript𝑑𝜈2𝒜subscript2𝑇𝛿\displaystyle=2m_{1}+C\sqrt{2^{\ell_{\text{max}}}\cdot m_{1}\cdot d_{\nu}\cdot% \log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}\right)}= 2 italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_C square-root start_ARG 2 start_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ⋅ roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG
2m1+Cdνlog(2|𝒜|log2(T)δ)m1Δmin2m1dνlog(2|𝒜|log2(T)δ)absent2subscript𝑚1𝐶subscript𝑑𝜈2𝒜subscript2𝑇𝛿subscript𝑚1superscriptsubscriptΔ2subscript𝑚1subscript𝑑𝜈2𝒜subscript2𝑇𝛿\displaystyle\leq 2m_{1}+C\sqrt{\frac{d_{\nu}\log\mathopen{}\left(\frac{2|% \mathcal{A}|\log_{2}(T)}{\delta}\right)}{m_{1}\Delta_{\min}^{2}}\cdot m_{1}% \cdot d_{\nu}\log\mathopen{}\left(\frac{2|\mathcal{A}|\log_{2}(T)}{\delta}% \right)}≤ 2 italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_C square-root start_ARG divide start_ARG italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT roman_log ( divide start_ARG 2 | caligraphic_A | roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T ) end_ARG start_ARG italic_δ end_ARG ) end_ARG
Cdνlog(|𝒜|logT/δ)Δmin,absent𝐶subscript𝑑𝜈𝒜𝑇𝛿subscriptΔ\displaystyle\leq C\cdot\frac{d_{\nu}\log\mathopen{}\left(|\mathcal{A}|\log T/% \delta\right)}{\Delta_{\min}},≤ italic_C ⋅ divide start_ARG italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT roman_log ( | caligraphic_A | roman_log italic_T / italic_δ ) end_ARG start_ARG roman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ,

where C>0𝐶0C>0italic_C > 0 is an absolute constant that can vary from line to line. Again the above regret bound holds with probability at least 1δ1𝛿1-\delta1 - italic_δ so we are done. ∎

Appendix C Anytime Regret Bounds for UCB and C-UCB

In this section we verify Proposition 3.4 for UCBUCB\operatorname{UCB}roman_UCB and CUCBCUCB\operatorname{C-UCB}roman_C - roman_UCB algorithms for completeness. Note that our anytime regret bound for CUCBCUCB\operatorname{C-UCB}roman_C - roman_UCB is new in the literature.

C.1 Preliminaries

For each t[T],a𝒜formulae-sequence𝑡delimited-[]𝑇𝑎𝒜t\in[T],a\in\mathcal{A}italic_t ∈ [ italic_T ] , italic_a ∈ caligraphic_A and z𝒵𝑧𝒵z\in\mathcal{Z}italic_z ∈ caligraphic_Z, define 𝕋t𝒜(a)=1s=1t𝕀{As=a}subscriptsuperscript𝕋𝒜𝑡𝑎1superscriptsubscript𝑠1𝑡𝕀subscript𝐴𝑠𝑎\mathbb{T}^{\mathcal{A}}_{t}(a)=1\vee\sum_{s=1}^{t}\mathbb{I}\{A_{s}=a\}blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) = 1 ∨ ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_I { italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_a } to be the number of action a𝑎aitalic_a being chosen in the first t𝑡titalic_t rounds, and define 𝕋t𝒵(z)=1s=1t𝕀{Zs(As)=z}subscriptsuperscript𝕋𝒵𝑡𝑧1superscriptsubscript𝑠1𝑡𝕀subscript𝑍𝑠subscript𝐴𝑠𝑧\mathbb{T}^{\mathcal{Z}}_{t}(z)=1\vee\sum_{s=1}^{t}\mathbb{I}\{Z_{s}(A_{s})=z\}blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z ) = 1 ∨ ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_I { italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_z } to be the number of context z𝑧zitalic_z being observed up to the first t𝑡titalic_t rounds. Further define the mean reward estimates μ^t𝒜(a),μ^t𝒵(z)subscriptsuperscript^𝜇𝒜𝑡𝑎subscriptsuperscript^𝜇𝒵𝑡𝑧\hat{\mu}^{\mathcal{A}}_{t}(a),\hat{\mu}^{\mathcal{Z}}_{t}(z)over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z ) by

μ^t𝒜(a)subscriptsuperscript^𝜇𝒜𝑡𝑎\displaystyle\hat{\mu}^{\mathcal{A}}_{t}(a)over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) =1𝕋t𝒜(a)s=1tYs(As)𝕀{As=a}absent1subscriptsuperscript𝕋𝒜𝑡𝑎superscriptsubscript𝑠1𝑡subscript𝑌𝑠subscript𝐴𝑠𝕀subscript𝐴𝑠𝑎\displaystyle=\frac{1}{\mathbb{T}^{\mathcal{A}}_{t}(a)}\sum_{s=1}^{t}Y_{s}(A_{% s})\mathbb{I}\{A_{s}=a\}= divide start_ARG 1 end_ARG start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) blackboard_I { italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_a }
μ^t𝒵(z)subscriptsuperscript^𝜇𝒵𝑡𝑧\displaystyle\hat{\mu}^{\mathcal{Z}}_{t}(z)over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z ) =1𝕋t𝒵(z)s=1tYs(As)𝕀{Zs(As)=z}absent1subscriptsuperscript𝕋𝒵𝑡𝑧superscriptsubscript𝑠1𝑡subscript𝑌𝑠subscript𝐴𝑠𝕀subscript𝑍𝑠subscript𝐴𝑠𝑧\displaystyle=\frac{1}{\mathbb{T}^{\mathcal{Z}}_{t}(z)}\sum_{s=1}^{t}Y_{s}(A_{% s})\mathbb{I}\{Z_{s}(A_{s})=z\}= divide start_ARG 1 end_ARG start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z ) end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) blackboard_I { italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_z }

Then we introduce the upper confidence bounds used by the UCB-type algorithms under consideration. Given any prescribed confidence parameter δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), define UCBt𝒜(a)=μ^t𝒜(a)+log(2|𝒜|T/δ)2𝕋t𝒜(a),UCBt𝒵(z)=μ^t𝒵(z)+log(2|𝒵|T/δ)2𝕋t𝒵(z)formulae-sequencesubscriptsuperscriptUCB𝒜𝑡𝑎subscriptsuperscript^𝜇𝒜𝑡𝑎2𝒜𝑇𝛿2subscriptsuperscript𝕋𝒜𝑡𝑎subscriptsuperscriptUCB𝒵𝑡𝑧subscriptsuperscript^𝜇𝒵𝑡𝑧2𝒵𝑇𝛿2subscriptsuperscript𝕋𝒵𝑡𝑧\mathrm{UCB}^{\mathcal{A}}_{t}(a)=\hat{\mu}^{\mathcal{A}}_{t}(a)+\sqrt{\frac{% \log(2|\mathcal{A}|T/\delta)}{2\mathbb{T}^{\mathcal{A}}_{t}(a)}},\mathrm{UCB}^% {\mathcal{Z}}_{t}(z)=\hat{\mu}^{\mathcal{Z}}_{t}(z)+\sqrt{\frac{\log(2|% \mathcal{Z}|T/\delta)}{2\mathbb{T}^{\mathcal{Z}}_{t}(z)}}roman_UCB start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) = over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) + square-root start_ARG divide start_ARG roman_log ( 2 | caligraphic_A | italic_T / italic_δ ) end_ARG start_ARG 2 blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) end_ARG end_ARG , roman_UCB start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z ) = over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z ) + square-root start_ARG divide start_ARG roman_log ( 2 | caligraphic_Z | italic_T / italic_δ ) end_ARG start_ARG 2 blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z ) end_ARG end_ARG and UCB~t(a)=z𝒵UCBt𝒵(z)νa[Z=z]subscript~UCB𝑡𝑎subscript𝑧𝒵subscriptsuperscriptUCB𝒵𝑡𝑧subscriptsubscript𝜈𝑎delimited-[]𝑍𝑧\widetilde{\mathrm{UCB}}_{t}(a)=\sum_{z\in\mathcal{Z}}\mathrm{UCB}^{\mathcal{Z% }}_{t}(z)\mathbb{P}_{\nu_{a}}[Z=z]over~ start_ARG roman_UCB end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) = ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT roman_UCB start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z ) blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z = italic_z ] for each t[T],a𝒜formulae-sequence𝑡delimited-[]𝑇𝑎𝒜t\in[T],a\in\mathcal{A}italic_t ∈ [ italic_T ] , italic_a ∈ caligraphic_A and z𝒵𝑧𝒵z\in\mathcal{Z}italic_z ∈ caligraphic_Z. Furthermore, we use UCB(δ)UCB𝛿\operatorname{UCB}(\delta)roman_UCB ( italic_δ ) and CUCB(δ)CUCB𝛿\operatorname{C-UCB}(\delta)start_OPFUNCTION roman_C - roman_UCB end_OPFUNCTION ( italic_δ ) to denote the standard UCB algorithm and C-UCB algorithm (Lu et al. 2020) which run by playing actions AtUCBsubscriptsuperscript𝐴UCB𝑡A^{\mathrm{UCB}}_{t}italic_A start_POSTSUPERSCRIPT roman_UCB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and AtCUCBsubscriptsuperscript𝐴CUCB𝑡A^{\mathrm{C-UCB}}_{t}italic_A start_POSTSUPERSCRIPT roman_C - roman_UCB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at each round t𝑡titalic_t respectively, according to:

AtUCBsubscriptsuperscript𝐴UCB𝑡\displaystyle A^{\mathrm{UCB}}_{t}italic_A start_POSTSUPERSCRIPT roman_UCB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =argmaxa𝒜UCBt1𝒜(a)absentsubscriptarg𝑎𝒜subscriptsuperscriptUCB𝒜𝑡1𝑎\displaystyle=\operatorname*{arg\!\max}_{a\in\mathcal{A}}\mathrm{UCB}^{% \mathcal{A}}_{t-1}(a)= start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT roman_UCB start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_a )
AtCUCBsubscriptsuperscript𝐴CUCB𝑡\displaystyle A^{\mathrm{C-UCB}}_{t}italic_A start_POSTSUPERSCRIPT roman_C - roman_UCB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =argmaxa𝒜UCB~t1(a).absentsubscriptarg𝑎𝒜subscript~UCB𝑡1𝑎\displaystyle=\operatorname*{arg\!\max}_{a\in\mathcal{A}}\widetilde{\mathrm{% UCB}}_{t-1}(a).= start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT over~ start_ARG roman_UCB end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_a ) .

Before analyzing the regret of UCB(δ)UCB𝛿\operatorname{UCB}(\delta)roman_UCB ( italic_δ ) and CUCB(δ)CUCB𝛿\operatorname{C-UCB}(\delta)start_OPFUNCTION roman_C - roman_UCB end_OPFUNCTION ( italic_δ ), let’s finally define some high-probability events on which we can control the regret. For any given confidence parameters δ,δ𝛿superscript𝛿\delta,\delta^{\prime}italic_δ , italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, define

E𝒜(δ)={t[T],a𝒜,|μ^t𝒜(a)μ𝒜(a)|log(2|𝒜|T/δ)2𝕋t𝒜(a)},superscript𝐸𝒜𝛿formulae-sequencefor-all𝑡delimited-[]𝑇formulae-sequence𝑎𝒜subscriptsuperscript^𝜇𝒜𝑡𝑎superscript𝜇𝒜𝑎2𝒜𝑇𝛿2subscriptsuperscript𝕋𝒜𝑡𝑎\displaystyle E^{\mathcal{A}}(\delta)=\bigg{\{}\forall t\in[T],a\in\mathcal{A}% ,|\hat{\mu}^{\mathcal{A}}_{t}(a)-\mu^{\mathcal{A}}(a)|\leq\sqrt{\frac{\log(2|% \mathcal{A}|T/\delta)}{2\mathbb{T}^{\mathcal{A}}_{t}(a)}}\bigg{\}},italic_E start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_δ ) = { ∀ italic_t ∈ [ italic_T ] , italic_a ∈ caligraphic_A , | over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a ) | ≤ square-root start_ARG divide start_ARG roman_log ( 2 | caligraphic_A | italic_T / italic_δ ) end_ARG start_ARG 2 blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) end_ARG end_ARG } ,

and in conditionally benign environments we additionally define

E𝒵(δ)superscript𝐸𝒵𝛿\displaystyle E^{\mathcal{Z}}(\delta)italic_E start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ( italic_δ ) ={t[T],z𝒵,|μ^t𝒵(z)μ𝒵(z))|log(2|𝒵|T/δ)2𝕋t𝒵(z)},\displaystyle=\bigg{\{}\forall t\in[T],z\in\mathcal{Z},|\hat{\mu}^{\mathcal{Z}% }_{t}(z)-\mu^{\mathcal{Z}}(z))|\leq\sqrt{\frac{\log(2|\mathcal{Z}|T/\delta)}{2% \mathbb{T}^{\mathcal{Z}}_{t}(z)}}\bigg{\}},= { ∀ italic_t ∈ [ italic_T ] , italic_z ∈ caligraphic_Z , | over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z ) - italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ( italic_z ) ) | ≤ square-root start_ARG divide start_ARG roman_log ( 2 | caligraphic_Z | italic_T / italic_δ ) end_ARG start_ARG 2 blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z ) end_ARG end_ARG } ,
EMG(δ)superscript𝐸MGsuperscript𝛿\displaystyle E^{\mathrm{MG}}(\delta^{\prime})italic_E start_POSTSUPERSCRIPT roman_MG end_POSTSUPERSCRIPT ( italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ={t[T],s=1tz𝒵1𝕋s1𝒵(z)(νAs[Z=z]𝕀{Zs=z})2tlog(T/δ)},absentformulae-sequencefor-all𝑡delimited-[]𝑇superscriptsubscript𝑠1𝑡subscript𝑧𝒵1subscriptsuperscript𝕋𝒵𝑠1𝑧subscriptsubscript𝜈subscript𝐴𝑠delimited-[]𝑍𝑧𝕀subscript𝑍𝑠𝑧2𝑡𝑇superscript𝛿\displaystyle=\bigg{\{}\forall t\in[T],\sum_{s=1}^{t}\sum_{z\in\mathcal{Z}}% \frac{1}{\sqrt{\mathbb{T}^{\mathcal{Z}}_{s-1}(z)}}(\mathbb{P}_{\nu_{A_{s}}}[Z=% z]-\mathbb{I}\{Z_{s}=z\})\leq\sqrt{2t\log(T/\delta^{\prime})}\bigg{\}},= { ∀ italic_t ∈ [ italic_T ] , ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_z ) end_ARG end_ARG ( blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z = italic_z ] - blackboard_I { italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_z } ) ≤ square-root start_ARG 2 italic_t roman_log ( italic_T / italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG } ,

where we recall μ𝒵(z)=𝔼νa[Y|Z=z]superscript𝜇𝒵𝑧subscript𝔼subscript𝜈𝑎delimited-[]conditional𝑌𝑍𝑧\mu^{\mathcal{Z}}(z)=\mathbb{E}_{\nu_{a}}[Y|Z=z]italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ( italic_z ) = blackboard_E start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_Z = italic_z ] is well-defined here. First we can see that E𝒜(δ)superscript𝐸𝒜𝛿E^{\mathcal{A}}(\delta)italic_E start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_δ ) and E𝒵(δ)superscript𝐸𝒵𝛿E^{\mathcal{Z}}(\delta)italic_E start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ( italic_δ ) happen with probability at least 1δ1𝛿1-\delta1 - italic_δ regardless the underlying environment and chosen policy:

Lemma C.1 (Lemma B.1 and B.2 in Bilodeau et al. 2022).

For any ν𝒫(𝒵×𝒴)𝒜𝜈𝒫superscript𝒵𝒴𝒜\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}italic_ν ∈ caligraphic_P ( caligraphic_Z × caligraphic_Y ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT and πΠ(𝒜,𝒵,T)𝜋Π𝒜𝒵𝑇\pi\in\Pi(\mathcal{A},\mathcal{Z},T)italic_π ∈ roman_Π ( caligraphic_A , caligraphic_Z , italic_T ),

ν,π[(E𝒜(δ))c]δ,subscript𝜈𝜋delimited-[]superscriptsuperscript𝐸𝒜𝛿𝑐𝛿\displaystyle\mathbb{P}_{\nu,\pi}[(E^{\mathcal{A}}(\delta))^{c}]\leq\delta,blackboard_P start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ ( italic_E start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_δ ) ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] ≤ italic_δ ,

and for any ν𝒫(𝒵×𝒴)𝒜𝜈𝒫superscript𝒵𝒴𝒜\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}italic_ν ∈ caligraphic_P ( caligraphic_Z × caligraphic_Y ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT that is conditionally benign and πΠ(𝒜,𝒵,T)𝜋Π𝒜𝒵𝑇\pi\in\Pi(\mathcal{A},\mathcal{Z},T)italic_π ∈ roman_Π ( caligraphic_A , caligraphic_Z , italic_T ),

ν,π[(E𝒵(δ))c]δ.subscript𝜈𝜋delimited-[]superscriptsuperscript𝐸𝒵𝛿𝑐𝛿\displaystyle\mathbb{P}_{\nu,\pi}[(E^{\mathcal{Z}}(\delta))^{c}]\leq\delta.blackboard_P start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ ( italic_E start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ( italic_δ ) ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] ≤ italic_δ .

To get our new anytime regret bound for CUCB(δ)CUCB𝛿\operatorname{C-UCB}(\delta)start_OPFUNCTION roman_C - roman_UCB end_OPFUNCTION ( italic_δ ), we need to further condition on EMG(δ)superscript𝐸MGsuperscript𝛿E^{\mathrm{MG}}(\delta^{\prime})italic_E start_POSTSUPERSCRIPT roman_MG end_POSTSUPERSCRIPT ( italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) which happens with probability at least 1δ1superscript𝛿1-\delta^{\prime}1 - italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

Lemma C.2.

For any ν𝒫(𝒵×𝒴)𝒜𝜈𝒫superscript𝒵𝒴𝒜\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}italic_ν ∈ caligraphic_P ( caligraphic_Z × caligraphic_Y ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT and πΠ(𝒜,𝒵,T)𝜋Π𝒜𝒵𝑇\pi\in\Pi(\mathcal{A},\mathcal{Z},T)italic_π ∈ roman_Π ( caligraphic_A , caligraphic_Z , italic_T ),

ν,π[(EMG(δ))c]δsubscript𝜈𝜋delimited-[]superscriptsuperscript𝐸MGsuperscript𝛿𝑐superscript𝛿\displaystyle\mathbb{P}_{\nu,\pi}[(E^{\mathrm{MG}}(\delta^{\prime}))^{c}]\leq% \delta^{\prime}blackboard_P start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ ( italic_E start_POSTSUPERSCRIPT roman_MG end_POSTSUPERSCRIPT ( italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] ≤ italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
Proof of Lemma C.2.

Define

Mtsubscript𝑀𝑡\displaystyle M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =s=1tz𝒵1𝕋s1𝒵(z)(νAs[Z=z]𝕀{Zs=z}),t[T],formulae-sequenceabsentsuperscriptsubscript𝑠1𝑡subscript𝑧𝒵1subscriptsuperscript𝕋𝒵𝑠1𝑧subscriptsubscript𝜈subscript𝐴𝑠delimited-[]𝑍𝑧𝕀subscript𝑍𝑠𝑧for-all𝑡delimited-[]𝑇\displaystyle=\sum_{s=1}^{t}\sum_{z\in\mathcal{Z}}\frac{1}{\sqrt{\mathbb{T}^{% \mathcal{Z}}_{s-1}(z)}}(\mathbb{P}_{\nu_{A_{s}}}[Z=z]-\mathbb{I}\{Z_{s}=z\}),% \forall t\in[T],= ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_z ) end_ARG end_ARG ( blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z = italic_z ] - blackboard_I { italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_z } ) , ∀ italic_t ∈ [ italic_T ] ,
M0subscript𝑀0\displaystyle M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =0.absent0\displaystyle=0.= 0 .

Then EMG(δ)={t[T],Mt2tlog(T/δ)}superscript𝐸MGsuperscript𝛿formulae-sequencefor-all𝑡delimited-[]𝑇subscript𝑀𝑡2𝑡𝑇superscript𝛿E^{\mathrm{MG}}(\delta^{\prime})=\bigg{\{}\forall t\in[T],M_{t}\leq\sqrt{2t% \log(T/\delta^{\prime})}\bigg{\}}italic_E start_POSTSUPERSCRIPT roman_MG end_POSTSUPERSCRIPT ( italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = { ∀ italic_t ∈ [ italic_T ] , italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ square-root start_ARG 2 italic_t roman_log ( italic_T / italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG } and it is easy to find that {Mt}t0subscriptsubscript𝑀𝑡𝑡0\{M_{t}\}_{t\geq 0}{ italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT is a martingale sequence with respect to t=σ(At,Ht1)subscript𝑡𝜎subscript𝐴𝑡subscript𝐻𝑡1\mathcal{F}_{t}=\sigma(A_{t},H_{t-1})caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_σ ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ). To see this,

𝔼ν,π[Mt|At,Ht1]=Mt1+z𝒵1𝕋t1𝒵(z)𝔼ν,π[νAt[Z=z]𝕀{Zt=z}|At]=Mt1.subscript𝔼𝜈𝜋delimited-[]conditionalsubscript𝑀𝑡subscript𝐴𝑡subscript𝐻𝑡1subscript𝑀𝑡1subscript𝑧𝒵1subscriptsuperscript𝕋𝒵𝑡1𝑧subscript𝔼𝜈𝜋delimited-[]subscriptsubscript𝜈subscript𝐴𝑡delimited-[]𝑍𝑧conditional𝕀subscript𝑍𝑡𝑧subscript𝐴𝑡subscript𝑀𝑡1\displaystyle\mathbb{E}_{\nu,\pi}[M_{t}|A_{t},H_{t-1}]=M_{t-1}+\sum_{z\in% \mathcal{Z}}\frac{1}{\sqrt{\mathbb{T}^{\mathcal{Z}}_{t-1}(z)}}\mathbb{E}_{\nu,% \pi}[\mathbb{P}_{\nu_{A_{t}}}[Z=z]-\mathbb{I}\{Z_{t}=z\}|A_{t}]=M_{t-1}.blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] = italic_M start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_z ) end_ARG end_ARG blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z = italic_z ] - blackboard_I { italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z } | italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = italic_M start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT .

Also,

|MtMt1|subscript𝑀𝑡subscript𝑀𝑡1\displaystyle|M_{t}-M_{t-1}|| italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | =|z𝒵1𝕋t1𝒵(z)(νAt[Z=z]𝕀{Zt=z})|absentsubscript𝑧𝒵1subscriptsuperscript𝕋𝒵𝑡1𝑧subscriptsubscript𝜈subscript𝐴𝑡delimited-[]𝑍𝑧𝕀subscript𝑍𝑡𝑧\displaystyle=\mathopen{}\left|\sum_{z\in\mathcal{Z}}\frac{1}{\sqrt{\mathbb{T}% ^{\mathcal{Z}}_{t-1}(z)}}\mathopen{}\left(\mathbb{P}_{\nu_{A_{t}}}[Z=z]-% \mathbb{I}\{Z_{t}=z\}\right)\right|= | ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_z ) end_ARG end_ARG ( blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z = italic_z ] - blackboard_I { italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z } ) |
=|𝔼ν,π[z𝒵1𝕋t1𝒵(z)𝕀{Zt=z}|At,Ht1]z𝒵1𝕋t1𝒵(z)𝕀{Zt=z}|\displaystyle=\mathopen{}\left|\mathbb{E}_{\nu,\pi}\mathopen{}\Big{[}\sum_{z% \in\mathcal{Z}}\frac{1}{\sqrt{\mathbb{T}^{\mathcal{Z}}_{t-1}(z)}}\mathbb{I}\{Z% _{t}=z\}|A_{t},H_{t-1}\Big{]}-\sum_{z\in\mathcal{Z}}\frac{1}{\sqrt{\mathbb{T}^% {\mathcal{Z}}_{t-1}(z)}}\mathbb{I}\{Z_{t}=z\}\right|= | blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_z ) end_ARG end_ARG blackboard_I { italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z } | italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] - ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_z ) end_ARG end_ARG blackboard_I { italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z } |
=|𝔼ν,π[1𝕋t1𝒵(Zt)|At,Ht1]1𝕋t1𝒵(Zt)|\displaystyle=\mathopen{}\left|\mathbb{E}_{\nu,\pi}\mathopen{}\Big{[}\frac{1}{% \sqrt{\mathbb{T}^{\mathcal{Z}}_{t-1}(Z_{t})}}|A_{t},H_{t-1}\Big{]}-\frac{1}{% \sqrt{\mathbb{T}^{\mathcal{Z}}_{t-1}(Z_{t})}}\right|= | blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG square-root start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG | italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] - divide start_ARG 1 end_ARG start_ARG square-root start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG |
1.absent1\displaystyle\leq 1.≤ 1 .

Then by Azuma-Hoeffding,

ν,π[Mt>2tlog(T/δ)]subscript𝜈𝜋delimited-[]subscript𝑀𝑡2𝑡𝑇superscript𝛿\displaystyle\mathbb{P}_{\nu,\pi}[M_{t}>\sqrt{2t\log(T/\delta^{\prime})}]blackboard_P start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > square-root start_ARG 2 italic_t roman_log ( italic_T / italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ] =ν,π[MtM0>2tlog(T/δ)]absentsubscript𝜈𝜋delimited-[]subscript𝑀𝑡subscript𝑀02𝑡𝑇superscript𝛿\displaystyle=\mathbb{P}_{\nu,\pi}[M_{t}-M_{0}>\sqrt{2t\log(T/\delta^{\prime})}]= blackboard_P start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > square-root start_ARG 2 italic_t roman_log ( italic_T / italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ]
exp(2tlog(T/δ)2t)=δ/T,absent2𝑡𝑇superscript𝛿2𝑡superscript𝛿𝑇\displaystyle\leq\exp\mathopen{}\left(-\frac{2t\log(T/\delta^{\prime})}{2t}% \right)=\delta^{\prime}/T,≤ roman_exp ( - divide start_ARG 2 italic_t roman_log ( italic_T / italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 italic_t end_ARG ) = italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_T ,

and we get ν,π[(EMG(δ))c]δsubscript𝜈𝜋delimited-[]superscriptsuperscript𝐸MGsuperscript𝛿𝑐superscript𝛿\mathbb{P}_{\nu,\pi}[(E^{\mathrm{MG}}(\delta^{\prime}))^{c}]\leq\delta^{\prime}blackboard_P start_POSTSUBSCRIPT italic_ν , italic_π end_POSTSUBSCRIPT [ ( italic_E start_POSTSUPERSCRIPT roman_MG end_POSTSUPERSCRIPT ( italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] ≤ italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT after taking a union bound over t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ]. ∎

C.2 Anytime High-probability Regret Bound

Now we provide our high-probability regret bounds for UCB(δ)UCB𝛿\operatorname{UCB}(\delta)roman_UCB ( italic_δ ) and CUCB(δ)CUCB𝛿\operatorname{C-UCB}(\delta)start_OPFUNCTION roman_C - roman_UCB end_OPFUNCTION ( italic_δ ) that will lead to Proposition 3.4.

Theorem C.3.

In any environment ν𝜈\nuitalic_ν, the regret of UCB(δ)UCB𝛿\operatorname{UCB}(\delta)roman_UCB ( italic_δ ) is bounded by

Reg(t)=O(|𝒜|log(|𝒜|T/δ)t)Reg𝑡𝑂𝒜𝒜𝑇𝛿𝑡\displaystyle\mathrm{Reg}(t)=O\mathopen{}\left(\sqrt{|\mathcal{A}|\log(|% \mathcal{A}|T/\delta)t}\right)roman_Reg ( italic_t ) = italic_O ( square-root start_ARG | caligraphic_A | roman_log ( | caligraphic_A | italic_T / italic_δ ) italic_t end_ARG )

for all t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], conditioning on event E𝒜(δ)superscript𝐸𝒜𝛿E^{\mathcal{A}}(\delta)italic_E start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_δ ) which happens with probability at least 1δ1𝛿1-\delta1 - italic_δ.

Proof of Theorem C.3.

In event E𝒜(δ)superscript𝐸𝒜𝛿E^{\mathcal{A}}(\delta)italic_E start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_δ ), we have that μ𝒜(a)UCBt𝒜(a)μ𝒜(a)+2log(2|𝒜|T/δ)2𝕋t𝒜(a)superscript𝜇𝒜𝑎subscriptsuperscriptUCB𝒜𝑡𝑎superscript𝜇𝒜𝑎22𝒜𝑇𝛿2subscriptsuperscript𝕋𝒜𝑡𝑎\mu^{\mathcal{A}}(a)\leq\mathrm{UCB}^{\mathcal{A}}_{t}(a)\leq\mu^{\mathcal{A}}% (a)+2\sqrt{\frac{\log(2|\mathcal{A}|T/\delta)}{2\mathbb{T}^{\mathcal{A}}_{t}(a% )}}italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a ) ≤ roman_UCB start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) ≤ italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a ) + 2 square-root start_ARG divide start_ARG roman_log ( 2 | caligraphic_A | italic_T / italic_δ ) end_ARG start_ARG 2 blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) end_ARG end_ARG for all a𝒜,t[T]formulae-sequence𝑎𝒜𝑡delimited-[]𝑇a\in\mathcal{A},t\in[T]italic_a ∈ caligraphic_A , italic_t ∈ [ italic_T ]. Hence conditioned on E𝒜(δ)superscript𝐸𝒜𝛿E^{\mathcal{A}}(\delta)italic_E start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_δ ), the regret of UCB(δ)UCB𝛿\operatorname{UCB}(\delta)roman_UCB ( italic_δ ) up to any round t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] holds

Reg(t)Reg𝑡\displaystyle\mathrm{Reg}(t)roman_Reg ( italic_t ) =s=1tμ𝒜(a)μ𝒜(As)absentsuperscriptsubscript𝑠1𝑡superscript𝜇𝒜superscript𝑎superscript𝜇𝒜subscript𝐴𝑠\displaystyle=\sum_{s=1}^{t}\mu^{\mathcal{A}}(a^{*})-\mu^{\mathcal{A}}(A_{s})= ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )
=s=1t(μ𝒜(a)UCBs1𝒜(As))+(UCBs1𝒜(As)μ𝒜(As))absentsuperscriptsubscript𝑠1𝑡superscript𝜇𝒜superscript𝑎subscriptsuperscriptUCB𝒜𝑠1subscript𝐴𝑠subscriptsuperscriptUCB𝒜𝑠1subscript𝐴𝑠superscript𝜇𝒜subscript𝐴𝑠\displaystyle=\sum_{s=1}^{t}(\mu^{\mathcal{A}}(a^{*})-\mathrm{UCB}^{\mathcal{A% }}_{s-1}(A_{s}))+(\mathrm{UCB}^{\mathcal{A}}_{s-1}(A_{s})-\mu^{\mathcal{A}}(A_% {s}))= ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - roman_UCB start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) + ( roman_UCB start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) )
s=1t(UCBs1𝒜(a)UCBs1𝒜(As))+(UCBs1𝒜(As)μ𝒜(As))absentsuperscriptsubscript𝑠1𝑡subscriptsuperscriptUCB𝒜𝑠1superscript𝑎subscriptsuperscriptUCB𝒜𝑠1subscript𝐴𝑠subscriptsuperscriptUCB𝒜𝑠1subscript𝐴𝑠superscript𝜇𝒜subscript𝐴𝑠\displaystyle\leq\sum_{s=1}^{t}(\mathrm{UCB}^{\mathcal{A}}_{s-1}(a^{*})-% \mathrm{UCB}^{\mathcal{A}}_{s-1}(A_{s}))+(\mathrm{UCB}^{\mathcal{A}}_{s-1}(A_{% s})-\mu^{\mathcal{A}}(A_{s}))≤ ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( roman_UCB start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - roman_UCB start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) + ( roman_UCB start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) )
s=1t(UCBs1𝒜(As)μ𝒜(As))absentsuperscriptsubscript𝑠1𝑡subscriptsuperscriptUCB𝒜𝑠1subscript𝐴𝑠superscript𝜇𝒜subscript𝐴𝑠\displaystyle\leq\sum_{s=1}^{t}(\mathrm{UCB}^{\mathcal{A}}_{s-1}(A_{s})-\mu^{% \mathcal{A}}(A_{s}))≤ ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( roman_UCB start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) )
s=1t2log(2|𝒜|T/δ)𝕋s1𝒜(As)absentsuperscriptsubscript𝑠1𝑡22𝒜𝑇𝛿subscriptsuperscript𝕋𝒜𝑠1subscript𝐴𝑠\displaystyle\leq\sum_{s=1}^{t}\sqrt{\frac{2\log(2|\mathcal{A}|T/\delta)}{% \mathbb{T}^{\mathcal{A}}_{s-1}(A_{s})}}≤ ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( 2 | caligraphic_A | italic_T / italic_δ ) end_ARG start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG end_ARG
=s=1ta𝒜2log(2|𝒜|T/δ)𝕋s1𝒜(As)𝕀{As=a}absentsuperscriptsubscript𝑠1𝑡subscript𝑎𝒜22𝒜𝑇𝛿subscriptsuperscript𝕋𝒜𝑠1subscript𝐴𝑠𝕀subscript𝐴𝑠𝑎\displaystyle=\sum_{s=1}^{t}\sum_{a\in\mathcal{A}}\sqrt{\frac{2\log(2|\mathcal% {A}|T/\delta)}{\mathbb{T}^{\mathcal{A}}_{s-1}(A_{s})}}\mathbb{I}\{A_{s}=a\}= ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( 2 | caligraphic_A | italic_T / italic_δ ) end_ARG start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG end_ARG blackboard_I { italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_a }
a𝒜8log(2|𝒜|T/δ)𝕋t1𝒜(a)absentsubscript𝑎𝒜82𝒜𝑇𝛿subscriptsuperscript𝕋𝒜𝑡1𝑎\displaystyle\leq\sum_{a\in\mathcal{A}}\sqrt{8\log(2|\mathcal{A}|T/\delta)% \mathbb{T}^{\mathcal{A}}_{t-1}(a)}≤ ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT square-root start_ARG 8 roman_log ( 2 | caligraphic_A | italic_T / italic_δ ) blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_a ) end_ARG
8log(2|𝒜|T/δ)|𝒜|t,absent82𝒜𝑇𝛿𝒜𝑡\displaystyle\leq\sqrt{8\log(2|\mathcal{A}|T/\delta)|\mathcal{A}|t},≤ square-root start_ARG 8 roman_log ( 2 | caligraphic_A | italic_T / italic_δ ) | caligraphic_A | italic_t end_ARG ,

where we use As=AsUCBsubscript𝐴𝑠subscriptsuperscript𝐴UCB𝑠A_{s}=A^{\mathrm{UCB}}_{s}italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT roman_UCB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT throughout to simplify our notation. ∎

Theorem C.4.

In any conditionally benign environment ν𝜈\nuitalic_ν, the regret of CUCB(δ)CUCB𝛿\operatorname{C-UCB}(\delta)start_OPFUNCTION roman_C - roman_UCB end_OPFUNCTION ( italic_δ ) is bounded by

Reg(t)=O(log(|𝒵|T/δ)(|𝒵|+log(T/δ))t)Reg𝑡𝑂𝒵𝑇𝛿𝒵𝑇superscript𝛿𝑡\displaystyle\mathrm{Reg}(t)=O\mathopen{}\left(\sqrt{\log(|\mathcal{Z}|T/% \delta)}\mathopen{}\left(\sqrt{|\mathcal{Z}|}+\sqrt{\log(T/\delta^{\prime})}% \right)\sqrt{t}\right)roman_Reg ( italic_t ) = italic_O ( square-root start_ARG roman_log ( | caligraphic_Z | italic_T / italic_δ ) end_ARG ( square-root start_ARG | caligraphic_Z | end_ARG + square-root start_ARG roman_log ( italic_T / italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ) square-root start_ARG italic_t end_ARG )

for all t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], conditioning on event E𝒵(δ)EMG(δ)superscript𝐸𝒵𝛿superscript𝐸MGsuperscript𝛿E^{\mathcal{Z}}(\delta)\cap E^{\mathrm{MG}}(\delta^{\prime})italic_E start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ( italic_δ ) ∩ italic_E start_POSTSUPERSCRIPT roman_MG end_POSTSUPERSCRIPT ( italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) which happens with probability at least 1δδ1𝛿superscript𝛿1-\delta-\delta^{\prime}1 - italic_δ - italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Proof of Theorem C.4.

Similarly in event E𝒵(δ)superscript𝐸𝒵𝛿E^{\mathcal{Z}}(\delta)italic_E start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ( italic_δ ) we have μ𝒵(z)UCBt𝒵(z)μ𝒵(z)+2log(2|𝒵|T/δ)2𝕋t𝒵(z)superscript𝜇𝒵𝑧subscriptsuperscriptUCB𝒵𝑡𝑧superscript𝜇𝒵𝑧22𝒵𝑇𝛿2subscriptsuperscript𝕋𝒵𝑡𝑧\mu^{\mathcal{Z}}(z)\leq\mathrm{UCB}^{\mathcal{Z}}_{t}(z)\leq\mu^{\mathcal{Z}}% (z)+2\sqrt{\frac{\log(2|\mathcal{Z}|T/\delta)}{2\mathbb{T}^{\mathcal{Z}}_{t}(z% )}}italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ( italic_z ) ≤ roman_UCB start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z ) ≤ italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ( italic_z ) + 2 square-root start_ARG divide start_ARG roman_log ( 2 | caligraphic_Z | italic_T / italic_δ ) end_ARG start_ARG 2 blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z ) end_ARG end_ARG for all z𝒵,t[T]formulae-sequence𝑧𝒵𝑡delimited-[]𝑇z\in\mathcal{Z},t\in[T]italic_z ∈ caligraphic_Z , italic_t ∈ [ italic_T ]. Additionally,

μ𝒜(a)superscript𝜇𝒜superscript𝑎\displaystyle\mu^{\mathcal{A}}(a^{*})italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) =z𝒵μ𝒵(z)νa[Z=z]absentsubscript𝑧𝒵superscript𝜇𝒵𝑧subscriptsubscript𝜈superscript𝑎delimited-[]𝑍𝑧\displaystyle=\sum_{z\in\mathcal{Z}}\mu^{\mathcal{Z}}(z)\mathbb{P}_{\nu_{a^{*}% }}[Z=z]= ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ( italic_z ) blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z = italic_z ]
z𝒵UCBt1𝒵(z)νa[Z=z]absentsubscript𝑧𝒵subscriptsuperscriptUCB𝒵𝑡1𝑧subscriptsubscript𝜈superscript𝑎delimited-[]𝑍𝑧\displaystyle\leq\sum_{z\in\mathcal{Z}}\mathrm{UCB}^{\mathcal{Z}}_{t-1}(z)% \mathbb{P}_{\nu_{a^{*}}}[Z=z]≤ ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT roman_UCB start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_z ) blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z = italic_z ]
=UCB~t1(a)UCB~t1(At),t[T],formulae-sequenceabsentsubscript~UCB𝑡1superscript𝑎subscript~UCB𝑡1subscript𝐴𝑡for-all𝑡delimited-[]𝑇\displaystyle=\widetilde{\mathrm{UCB}}_{t-1}(a^{*})\leq\widetilde{\mathrm{UCB}% }_{t-1}(A_{t}),\forall t\in[T],= over~ start_ARG roman_UCB end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ over~ start_ARG roman_UCB end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ∀ italic_t ∈ [ italic_T ] ,

where At=AtCUCBsubscript𝐴𝑡subscriptsuperscript𝐴CUCB𝑡A_{t}=A^{\mathrm{C-UCB}}_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT roman_C - roman_UCB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the action played by CUCB(δ)CUCB𝛿\operatorname{C-UCB}(\delta)start_OPFUNCTION roman_C - roman_UCB end_OPFUNCTION ( italic_δ ). Therefore we can control the cumulative the regret of CUCB(δ)CUCB𝛿\operatorname{C-UCB}(\delta)start_OPFUNCTION roman_C - roman_UCB end_OPFUNCTION ( italic_δ ) in the first t𝑡titalic_t rounds as follows

Reg(t)Reg𝑡\displaystyle\mathrm{Reg}(t)roman_Reg ( italic_t ) =s=1t(μ𝒜(a)UCB~s1(As))+(UCB~s1(As)μ𝒜(As))absentsuperscriptsubscript𝑠1𝑡superscript𝜇𝒜superscript𝑎subscript~UCB𝑠1subscript𝐴𝑠subscript~UCB𝑠1subscript𝐴𝑠superscript𝜇𝒜subscript𝐴𝑠\displaystyle=\sum_{s=1}^{t}(\mu^{\mathcal{A}}(a^{*})-\widetilde{\mathrm{UCB}}% _{s-1}(A_{s}))+(\widetilde{\mathrm{UCB}}_{s-1}(A_{s})-\mu^{\mathcal{A}}(A_{s}))= ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - over~ start_ARG roman_UCB end_ARG start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) + ( over~ start_ARG roman_UCB end_ARG start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) )
s=1tUCB~s1(As)μ𝒜(As)absentsuperscriptsubscript𝑠1𝑡subscript~UCB𝑠1subscript𝐴𝑠superscript𝜇𝒜subscript𝐴𝑠\displaystyle\leq\sum_{s=1}^{t}\widetilde{\mathrm{UCB}}_{s-1}(A_{s})-\mu^{% \mathcal{A}}(A_{s})≤ ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT over~ start_ARG roman_UCB end_ARG start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_μ start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )
=s=1tz𝒵(UCBs1𝒵(z)μ𝒵(z))νAs[Z=z]absentsuperscriptsubscript𝑠1𝑡subscript𝑧𝒵subscriptsuperscriptUCB𝒵𝑠1𝑧superscript𝜇𝒵𝑧subscriptsubscript𝜈subscript𝐴𝑠delimited-[]𝑍𝑧\displaystyle=\sum_{s=1}^{t}\sum_{z\in\mathcal{Z}}(\mathrm{UCB}^{\mathcal{Z}}_% {s-1}(z)-\mu^{\mathcal{Z}}(z))\mathbb{P}_{\nu_{A_{s}}}[Z=z]= ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT ( roman_UCB start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_z ) - italic_μ start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ( italic_z ) ) blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z = italic_z ]
s=1tz𝒵2log(2|𝒵|T/δ)𝕋s1𝒵(z)νAs[Z=z]absentsuperscriptsubscript𝑠1𝑡subscript𝑧𝒵22𝒵𝑇𝛿subscriptsuperscript𝕋𝒵𝑠1𝑧subscriptsubscript𝜈subscript𝐴𝑠delimited-[]𝑍𝑧\displaystyle\leq\sum_{s=1}^{t}\sum_{z\in\mathcal{Z}}\sqrt{\frac{2\log(2|% \mathcal{Z}|T/\delta)}{\mathbb{T}^{\mathcal{Z}}_{s-1}(z)}}\mathbb{P}_{\nu_{A_{% s}}}[Z=z]≤ ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( 2 | caligraphic_Z | italic_T / italic_δ ) end_ARG start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_z ) end_ARG end_ARG blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z = italic_z ]
=s=1tz𝒵2log(2|𝒵|T/δ)𝕋s1𝒵(z)𝕀{Zs=z}+s=1tz𝒵2log(2|𝒵|T/δ)𝕋s1𝒵(z)(νAs[Z=z]𝕀{Zs=z})absentsuperscriptsubscript𝑠1𝑡subscript𝑧𝒵22𝒵𝑇𝛿subscriptsuperscript𝕋𝒵𝑠1𝑧𝕀subscript𝑍𝑠𝑧superscriptsubscript𝑠1𝑡subscript𝑧𝒵22𝒵𝑇𝛿subscriptsuperscript𝕋𝒵𝑠1𝑧subscriptsubscript𝜈subscript𝐴𝑠delimited-[]𝑍𝑧𝕀subscript𝑍𝑠𝑧\displaystyle=\sum_{s=1}^{t}\sum_{z\in\mathcal{Z}}\sqrt{\frac{2\log(2|\mathcal% {Z}|T/\delta)}{\mathbb{T}^{\mathcal{Z}}_{s-1}(z)}}\mathbb{I}\{Z_{s}=z\}+\sum_{% s=1}^{t}\sum_{z\in\mathcal{Z}}\sqrt{\frac{2\log(2|\mathcal{Z}|T/\delta)}{% \mathbb{T}^{\mathcal{Z}}_{s-1}(z)}}(\mathbb{P}_{\nu_{A_{s}}}[Z=z]-\mathbb{I}\{% Z_{s}=z\})= ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( 2 | caligraphic_Z | italic_T / italic_δ ) end_ARG start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_z ) end_ARG end_ARG blackboard_I { italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_z } + ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( 2 | caligraphic_Z | italic_T / italic_δ ) end_ARG start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_z ) end_ARG end_ARG ( blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z = italic_z ] - blackboard_I { italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_z } )
8log(2|𝒵|T/δ)|𝒵|t+s=1tz𝒵2log(2|𝒵|T/δ)𝕋s1𝒵(z)(νAs[Z=z]𝕀{Zs=z}),absent82𝒵𝑇𝛿𝒵𝑡superscriptsubscript𝑠1𝑡subscript𝑧𝒵22𝒵𝑇𝛿subscriptsuperscript𝕋𝒵𝑠1𝑧subscriptsubscript𝜈subscript𝐴𝑠delimited-[]𝑍𝑧𝕀subscript𝑍𝑠𝑧\displaystyle\leq\sqrt{8\log(2|\mathcal{Z}|T/\delta)|\mathcal{Z}|t}+\sum_{s=1}% ^{t}\sum_{z\in\mathcal{Z}}\sqrt{\frac{2\log(2|\mathcal{Z}|T/\delta)}{\mathbb{T% }^{\mathcal{Z}}_{s-1}(z)}}(\mathbb{P}_{\nu_{A_{s}}}[Z=z]-\mathbb{I}\{Z_{s}=z\}),≤ square-root start_ARG 8 roman_log ( 2 | caligraphic_Z | italic_T / italic_δ ) | caligraphic_Z | italic_t end_ARG + ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( 2 | caligraphic_Z | italic_T / italic_δ ) end_ARG start_ARG blackboard_T start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_z ) end_ARG end_ARG ( blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z = italic_z ] - blackboard_I { italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_z } ) ,

where in the last inequality we use the same argument as in the proof of Theorem C.3, and the remaining summation term can be controlled by 4log(2|𝒵|T/δ)log(T/δ)t42𝒵𝑇𝛿𝑇superscript𝛿𝑡\sqrt{4\log(2|\mathcal{Z}|T/\delta)\log(T/\delta^{\prime})t}square-root start_ARG 4 roman_log ( 2 | caligraphic_Z | italic_T / italic_δ ) roman_log ( italic_T / italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_t end_ARG immediately after we further condition on EMG(δ)superscript𝐸MGsuperscript𝛿E^{\mathrm{MG}}(\delta^{\prime})italic_E start_POSTSUPERSCRIPT roman_MG end_POSTSUPERSCRIPT ( italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Therefore, we get

Reg(t)log(2|𝒵|T/δ)(8|𝒵|+4log(T/δ))t,t[T],formulae-sequenceReg𝑡2𝒵𝑇𝛿8𝒵4𝑇superscript𝛿𝑡for-all𝑡delimited-[]𝑇\displaystyle\mathrm{Reg}(t)\leq\sqrt{\log(2|\mathcal{Z}|T/\delta)}\mathopen{}% \left(\sqrt{8|\mathcal{Z}|}+\sqrt{4\log(T/\delta^{\prime})}\right)\sqrt{t},% \forall t\in[T],roman_Reg ( italic_t ) ≤ square-root start_ARG roman_log ( 2 | caligraphic_Z | italic_T / italic_δ ) end_ARG ( square-root start_ARG 8 | caligraphic_Z | end_ARG + square-root start_ARG 4 roman_log ( italic_T / italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ) square-root start_ARG italic_t end_ARG , ∀ italic_t ∈ [ italic_T ] ,

in event E𝒵(δ)EMG(δ)superscript𝐸𝒵𝛿superscript𝐸MGsuperscript𝛿E^{\mathcal{Z}}(\delta)\cap E^{\mathrm{MG}}(\delta^{\prime})italic_E start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT ( italic_δ ) ∩ italic_E start_POSTSUPERSCRIPT roman_MG end_POSTSUPERSCRIPT ( italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). ∎

Combining Theorem C.3 with Theorem C.4 and taking δ=δsuperscript𝛿𝛿\delta^{\prime}=\deltaitalic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_δ, we thus verfy Proposition 3.4.

Appendix D Proofs of Lower Bounds

In this section we give the full proof of Theorem 3.7 and Theorem 5.2. Note that our proof of Theorem 3.7 mainly adopts but also largely generalizes the one of Bilodeau et al. (2022, Theorem 6.2).

D.1 Proof of Theorem 3.7

Proof of Theorem 3.7.

Fix 𝒜,𝒵𝒜𝒵\mathcal{A},\mathcal{Z}caligraphic_A , caligraphic_Z and T𝑇Titalic_T. Let 𝒵0subscript𝒵0\mathcal{Z}_{0}caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be an arbitrary proper subset of 𝒵𝒵\mathcal{Z}caligraphic_Z and 𝒵1=𝒵𝒵0subscript𝒵1𝒵subscript𝒵0\mathcal{Z}_{1}=\mathcal{Z}\setminus\mathcal{Z}_{0}caligraphic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = caligraphic_Z ∖ caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Fix Δ(0,1/20)Δ0120\Delta\in(0,1/20)roman_Δ ∈ ( 0 , 1 / 20 ) to be chosen later. Define the family of marginals for all instances appearing in this proof

qa[Z𝒵0]={1/2+2Δa=11/2a1,subscript𝑞𝑎delimited-[]𝑍subscript𝒵0cases122Δ𝑎112𝑎1\displaystyle q_{a}[Z\in\mathcal{Z}_{0}]=\begin{cases}1/2+2\Delta&a=1\\ 1/2&a\neq 1,\end{cases}italic_q start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT [ italic_Z ∈ caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] = { start_ROW start_CELL 1 / 2 + 2 roman_Δ end_CELL start_CELL italic_a = 1 end_CELL end_ROW start_ROW start_CELL 1 / 2 end_CELL start_CELL italic_a ≠ 1 , end_CELL end_ROW

where probability is evenly spread within 𝒵0subscript𝒵0\mathcal{Z}_{0}caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒵1subscript𝒵1\mathcal{Z}_{1}caligraphic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT respectively. Then define a conditionally benign environment ν𝒫(𝒵×𝒴)𝒜𝜈𝒫superscript𝒵𝒴𝒜\nu\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}}italic_ν ∈ caligraphic_P ( caligraphic_Z × caligraphic_Y ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT by

νa[Y=1]=z𝒵p[Y=1|Z=z]qa[Z=z],a𝒜,formulae-sequencesubscriptsubscript𝜈𝑎delimited-[]𝑌1subscript𝑧𝒵𝑝delimited-[]𝑌conditional1𝑍𝑧subscript𝑞𝑎delimited-[]𝑍𝑧for-all𝑎𝒜\displaystyle\mathbb{P}_{\nu_{a}}[Y=1]=\sum_{z\in\mathcal{Z}}p[Y=1|Z=z]q_{a}[Z% =z],\quad\forall a\in\mathcal{A},blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y = 1 ] = ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT italic_p [ italic_Y = 1 | italic_Z = italic_z ] italic_q start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT [ italic_Z = italic_z ] , ∀ italic_a ∈ caligraphic_A ,

where p[Y|Z]𝑝delimited-[]conditional𝑌𝑍p[Y|Z]italic_p [ italic_Y | italic_Z ] is a Bernoulli conditional distribution such that

p[Y=1|Z=z]={3/4z𝒵01/4z𝒵1.𝑝delimited-[]𝑌conditional1𝑍𝑧cases34𝑧subscript𝒵014𝑧subscript𝒵1\displaystyle p[Y=1|Z=z]=\begin{cases}3/4&z\in\mathcal{Z}_{0}\\ 1/4&z\in\mathcal{Z}_{1}.\end{cases}italic_p [ italic_Y = 1 | italic_Z = italic_z ] = { start_ROW start_CELL 3 / 4 end_CELL start_CELL italic_z ∈ caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 / 4 end_CELL start_CELL italic_z ∈ caligraphic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . end_CELL end_ROW

Now we define some non-benign instances. For every a01subscript𝑎01a_{0}\neq 1italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≠ 1, define νa0subscript𝜈subscript𝑎0\nu_{a_{0}}italic_ν start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT by

νaa0[Y=1]=z𝒵paa0[Y=1|Z=z]qa[Z=z],a𝒜,formulae-sequencesubscriptsuperscriptsubscript𝜈𝑎subscript𝑎0delimited-[]𝑌1subscript𝑧𝒵superscriptsubscript𝑝𝑎subscript𝑎0delimited-[]𝑌conditional1𝑍𝑧subscript𝑞𝑎delimited-[]𝑍𝑧for-all𝑎𝒜\displaystyle\mathbb{P}_{\nu_{a}^{a_{0}}}[Y=1]=\sum_{z\in\mathcal{Z}}p_{a}^{a_% {0}}[Y=1|Z=z]q_{a}[Z=z],\quad\forall a\in\mathcal{A},blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_Y = 1 ] = ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_Y = 1 | italic_Z = italic_z ] italic_q start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT [ italic_Z = italic_z ] , ∀ italic_a ∈ caligraphic_A ,

where paa0[Y|Z]superscriptsubscript𝑝𝑎subscript𝑎0delimited-[]conditional𝑌𝑍p_{a}^{a_{0}}[Y|Z]italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_Y | italic_Z ] is a Bernoulli conditional distribution such that

paa0[Y=1|Z=z]={3/4a=1,z𝒵01/4a=1,z𝒵13/4+4Δa=a0,z𝒵01/4a=a0,z𝒵13/4a{1,a0},z𝒵01/4a{1,a0},z𝒵1.superscriptsubscript𝑝𝑎subscript𝑎0delimited-[]𝑌conditional1𝑍𝑧cases34formulae-sequence𝑎1𝑧subscript𝒵014formulae-sequence𝑎1𝑧subscript𝒵1344Δformulae-sequence𝑎subscript𝑎0𝑧subscript𝒵014formulae-sequence𝑎subscript𝑎0𝑧subscript𝒵134formulae-sequence𝑎1subscript𝑎0𝑧subscript𝒵014formulae-sequence𝑎1subscript𝑎0𝑧subscript𝒵1\displaystyle p_{a}^{a_{0}}[Y=1|Z=z]=\begin{cases}3/4&a=1,z\in\mathcal{Z}_{0}% \\ 1/4&a=1,z\in\mathcal{Z}_{1}\\ 3/4+4\Delta&a=a_{0},z\in\mathcal{Z}_{0}\\ 1/4&a=a_{0},z\in\mathcal{Z}_{1}\\ 3/4&a\notin\{1,a_{0}\},z\in\mathcal{Z}_{0}\\ 1/4&a\notin\{1,a_{0}\},z\in\mathcal{Z}_{1}.\end{cases}italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_Y = 1 | italic_Z = italic_z ] = { start_ROW start_CELL 3 / 4 end_CELL start_CELL italic_a = 1 , italic_z ∈ caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 / 4 end_CELL start_CELL italic_a = 1 , italic_z ∈ caligraphic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 3 / 4 + 4 roman_Δ end_CELL start_CELL italic_a = italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_z ∈ caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 / 4 end_CELL start_CELL italic_a = italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_z ∈ caligraphic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 3 / 4 end_CELL start_CELL italic_a ∉ { 1 , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } , italic_z ∈ caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 / 4 end_CELL start_CELL italic_a ∉ { 1 , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } , italic_z ∈ caligraphic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . end_CELL end_ROW

For any MAB algorithm 𝔞𝔞\mathfrak{a}fraktur_a, let πq=𝔞(𝒜,𝒵,q,T)superscript𝜋𝑞𝔞𝒜𝒵𝑞𝑇\pi^{q}=\mathfrak{a}(\mathcal{A},\mathcal{Z},q,T)italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = fraktur_a ( caligraphic_A , caligraphic_Z , italic_q , italic_T ) be the actual policy implemented by 𝔞𝔞\mathfrak{a}fraktur_a when it’s interacting with ν𝜈\nuitalic_ν and νa0superscript𝜈subscript𝑎0\nu^{a_{0}}italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Then by the divergence decomposition formula and Bretagnolle-Huber inequality,

𝔼ν,πq[Reg(T)]+𝔼νa0,πq[Reg(T)]subscript𝔼𝜈superscript𝜋𝑞delimited-[]Reg𝑇subscript𝔼superscript𝜈subscript𝑎0superscript𝜋𝑞delimited-[]Reg𝑇\displaystyle\mathbb{E}_{\nu,\pi^{q}}[\mathrm{Reg}(T)]+\mathbb{E}_{\nu^{a_{0}}% ,\pi^{q}}[\mathrm{Reg}(T)]blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] + blackboard_E start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] TΔ2ν,πq[𝕋T𝒜(1)T/2]+TΔ2νa0,πq[𝕋T𝒜(1)>T/2]absent𝑇Δ2subscript𝜈superscript𝜋𝑞delimited-[]subscriptsuperscript𝕋𝒜𝑇1𝑇2𝑇Δ2subscriptsuperscript𝜈subscript𝑎0superscript𝜋𝑞delimited-[]subscriptsuperscript𝕋𝒜𝑇1𝑇2\displaystyle\geq\frac{T\Delta}{2}\mathbb{P}_{\nu,\pi^{q}}[\mathbb{T}^{% \mathcal{A}}_{T}(1)\leq T/2]+\frac{T\Delta}{2}\mathbb{P}_{\nu^{a_{0}},\pi^{q}}% [\mathbb{T}^{\mathcal{A}}_{T}(1)>T/2]≥ divide start_ARG italic_T roman_Δ end_ARG start_ARG 2 end_ARG blackboard_P start_POSTSUBSCRIPT italic_ν , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( 1 ) ≤ italic_T / 2 ] + divide start_ARG italic_T roman_Δ end_ARG start_ARG 2 end_ARG blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( 1 ) > italic_T / 2 ]
TΔ4exp(KL(ν,πqνa0,πq))absent𝑇Δ4KLconditionalsubscript𝜈superscript𝜋𝑞subscriptsuperscript𝜈subscript𝑎0superscript𝜋𝑞\displaystyle\geq\frac{T\Delta}{4}\exp(-\mathrm{KL}(\mathbb{P}_{\nu,\pi^{q}}\ % \|\ \mathbb{P}_{\nu^{a_{0}},\pi^{q}}))≥ divide start_ARG italic_T roman_Δ end_ARG start_ARG 4 end_ARG roman_exp ( - roman_KL ( blackboard_P start_POSTSUBSCRIPT italic_ν , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) )
=TΔ4exp(12𝔼ν,πq[𝕋T𝒜(a0)]KL(Ber(3/4)Ber(3/4+4Δ)))absent𝑇Δ412subscript𝔼𝜈superscript𝜋𝑞delimited-[]subscriptsuperscript𝕋𝒜𝑇subscript𝑎0KLconditionalBer34Ber344Δ\displaystyle=\frac{T\Delta}{4}\exp\mathopen{}\left(-\frac{1}{2}\mathbb{E}_{% \nu,\pi^{q}}[\mathbb{T}^{\mathcal{A}}_{T}(a_{0})]\mathrm{KL}(\mathrm{Ber}(3/4)% \ \|\ \mathrm{Ber}(3/4+4\Delta))\right)= divide start_ARG italic_T roman_Δ end_ARG start_ARG 4 end_ARG roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] roman_KL ( roman_Ber ( 3 / 4 ) ∥ roman_Ber ( 3 / 4 + 4 roman_Δ ) ) )
TΔ4exp(𝔼ν,πq[𝕋T𝒜(a0)]32Δ2),absent𝑇Δ4subscript𝔼𝜈superscript𝜋𝑞delimited-[]subscriptsuperscript𝕋𝒜𝑇subscript𝑎032superscriptΔ2\displaystyle\geq\frac{T\Delta}{4}\exp\mathopen{}\left(-\mathbb{E}_{\nu,\pi^{q% }}[\mathbb{T}^{\mathcal{A}}_{T}(a_{0})]\cdot 32\Delta^{2}\right),≥ divide start_ARG italic_T roman_Δ end_ARG start_ARG 4 end_ARG roman_exp ( - blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ⋅ 32 roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where in the last step we use KL(Ber(3/4)Ber(3/4+4Δ))64Δ2KLconditionalBer34Ber344Δ64superscriptΔ2\mathrm{KL}(\mathrm{Ber}(3/4)\ \|\ \mathrm{Ber}(3/4+4\Delta))\leq 64\Delta^{2}roman_KL ( roman_Ber ( 3 / 4 ) ∥ roman_Ber ( 3 / 4 + 4 roman_Δ ) ) ≤ 64 roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for Δ<1/40Δ140\Delta<1/40roman_Δ < 1 / 40. Combined with the worst-case regret upper bound 𝔼ν,πq[Reg(T)]+𝔼νa0,πq[Reg(T)]2R(T;𝒜,𝒵)subscript𝔼𝜈superscript𝜋𝑞delimited-[]Reg𝑇subscript𝔼superscript𝜈subscript𝑎0superscript𝜋𝑞delimited-[]Reg𝑇2𝑅𝑇𝒜𝒵\mathbb{E}_{\nu,\pi^{q}}[\mathrm{Reg}(T)]+\mathbb{E}_{\nu^{a_{0}},\pi^{q}}[% \mathrm{Reg}(T)]\leq 2R(T;\mathcal{A},\mathcal{Z})blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] + blackboard_E start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] ≤ 2 italic_R ( italic_T ; caligraphic_A , caligraphic_Z ), it implies that

𝔼ν,πq[𝕋T𝒜(a0)]132Δ2log(TΔ8R(T;𝒜,𝒵)),a01.formulae-sequencesubscript𝔼𝜈superscript𝜋𝑞delimited-[]subscriptsuperscript𝕋𝒜𝑇subscript𝑎0132superscriptΔ2𝑇Δ8𝑅𝑇𝒜𝒵for-allsubscript𝑎01\displaystyle\mathbb{E}_{\nu,\pi^{q}}[\mathbb{T}^{\mathcal{A}}_{T}(a_{0})]\geq% \frac{1}{32\Delta^{2}}\log\mathopen{}\left(\frac{T\Delta}{8R(T;\mathcal{A},% \mathcal{Z})}\right),\forall a_{0}\neq 1.blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ≥ divide start_ARG 1 end_ARG start_ARG 32 roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log ( divide start_ARG italic_T roman_Δ end_ARG start_ARG 8 italic_R ( italic_T ; caligraphic_A , caligraphic_Z ) end_ARG ) , ∀ italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≠ 1 .

Realizing 𝔼ν,πq[Reg(T)]=a01Δ𝔼ν,πq[𝕋T𝒜(a0)]subscript𝔼𝜈superscript𝜋𝑞delimited-[]Reg𝑇subscriptsubscript𝑎01Δsubscript𝔼𝜈superscript𝜋𝑞delimited-[]subscriptsuperscript𝕋𝒜𝑇subscript𝑎0\mathbb{E}_{\nu,\pi^{q}}[\mathrm{Reg}(T)]=\sum_{a_{0}\neq 1}\Delta\mathbb{E}_{% \nu,\pi^{q}}[\mathbb{T}^{\mathcal{A}}_{T}(a_{0})]blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] = ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≠ 1 end_POSTSUBSCRIPT roman_Δ blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ], we have

𝔼ν,πq[Reg(T)]|𝒜|132Δlog(TΔ8R(T;𝒜,𝒵)).subscript𝔼𝜈superscript𝜋𝑞delimited-[]Reg𝑇𝒜132Δ𝑇Δ8𝑅𝑇𝒜𝒵\displaystyle\mathbb{E}_{\nu,\pi^{q}}[\mathrm{Reg}(T)]\geq\frac{|\mathcal{A}|-% 1}{32\Delta}\log\mathopen{}\left(\frac{T\Delta}{8R(T;\mathcal{A},\mathcal{Z})}% \right).blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] ≥ divide start_ARG | caligraphic_A | - 1 end_ARG start_ARG 32 roman_Δ end_ARG roman_log ( divide start_ARG italic_T roman_Δ end_ARG start_ARG 8 italic_R ( italic_T ; caligraphic_A , caligraphic_Z ) end_ARG ) .

So there exists absolute constants c=log2/1024,c=1/641formulae-sequence𝑐21024superscript𝑐1641c=\log 2/1024,c^{\prime}=1/641italic_c = roman_log 2 / 1024 , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 / 641 such that whenever R(T;𝒜,𝒵)cT𝑅𝑇𝒜𝒵superscript𝑐𝑇R(T;\mathcal{A},\mathcal{Z})\leq c^{\prime}Titalic_R ( italic_T ; caligraphic_A , caligraphic_Z ) ≤ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T, the choice of Δ=16R(T;𝒜,𝒵)TΔ16𝑅𝑇𝒜𝒵𝑇\Delta=\frac{16R(T;\mathcal{A},\mathcal{Z})}{T}roman_Δ = divide start_ARG 16 italic_R ( italic_T ; caligraphic_A , caligraphic_Z ) end_ARG start_ARG italic_T end_ARG satisfies Δ<1/40Δ140\Delta<1/40roman_Δ < 1 / 40 and

𝔼ν,πq[Reg(T)]c|𝒜|TR(T;𝒜,𝒵),subscript𝔼𝜈superscript𝜋𝑞delimited-[]Reg𝑇𝑐𝒜𝑇𝑅𝑇𝒜𝒵\displaystyle\mathbb{E}_{\nu,\pi^{q}}[\mathrm{Reg}(T)]\geq c\cdot\frac{|% \mathcal{A}|T}{R(T;\mathcal{A},\mathcal{Z})},blackboard_E start_POSTSUBSCRIPT italic_ν , italic_π start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] ≥ italic_c ⋅ divide start_ARG | caligraphic_A | italic_T end_ARG start_ARG italic_R ( italic_T ; caligraphic_A , caligraphic_Z ) end_ARG ,

which completes the proof. ∎

D.2 Proof of Theorem 5.2

Proof of Theorem 5.2.

Fix 𝒜,𝒵𝒜𝒵\mathcal{A},\mathcal{Z}caligraphic_A , caligraphic_Z and T𝑇Titalic_T. Let 𝒵0subscript𝒵0\mathcal{Z}_{0}caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be an arbitrary proper subset of 𝒵𝒵\mathcal{Z}caligraphic_Z and 𝒵1=𝒵𝒵0subscript𝒵1𝒵subscript𝒵0\mathcal{Z}_{1}=\mathcal{Z}\setminus\mathcal{Z}_{0}caligraphic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = caligraphic_Z ∖ caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Fix Δ(0,140)Δ0140\Delta\in(0,\frac{1}{40})roman_Δ ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 40 end_ARG ) to be chosen later. For all conditionally benign instances ν𝜈\nuitalic_ν in this proof, we consider νa[Y|Z]subscriptsubscript𝜈𝑎delimited-[]conditional𝑌𝑍\mathbb{P}_{\nu_{a}}[Y|Z]blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_Z ] to be the Bernoulli distribution given by

νa[Y=1|Z=z]=p[Y=1|Z=z]={3/4z𝒵01/4z𝒵1,subscriptsubscript𝜈𝑎delimited-[]𝑌conditional1𝑍𝑧𝑝delimited-[]𝑌conditional1𝑍𝑧cases34𝑧subscript𝒵014𝑧subscript𝒵1\displaystyle\mathbb{P}_{\nu_{a}}[Y=1|Z=z]=p[Y=1|Z=z]=\begin{cases}3/4&z\in% \mathcal{Z}_{0}\\ 1/4&z\in\mathcal{Z}_{1},\end{cases}blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y = 1 | italic_Z = italic_z ] = italic_p [ italic_Y = 1 | italic_Z = italic_z ] = { start_ROW start_CELL 3 / 4 end_CELL start_CELL italic_z ∈ caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 / 4 end_CELL start_CELL italic_z ∈ caligraphic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , end_CELL end_ROW

which implies that contexts from 𝒵0subscript𝒵0\mathcal{Z}_{0}caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are more rewarding than those from 𝒵1subscript𝒵1\mathcal{Z}_{1}caligraphic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Now define conditionally benign environments ν,νa0𝒫(𝒵×𝒴)𝒜,a01formulae-sequence𝜈superscript𝜈subscript𝑎0𝒫superscript𝒵𝒴𝒜for-allsubscript𝑎01\nu,\nu^{a_{0}}\in\mathcal{P}(\mathcal{Z}\times\mathcal{Y})^{\mathcal{A}},% \forall a_{0}\neq 1italic_ν , italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∈ caligraphic_P ( caligraphic_Z × caligraphic_Y ) start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT , ∀ italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≠ 1, through their marginals

νa[Y=1]subscriptsubscript𝜈𝑎delimited-[]𝑌1\displaystyle\mathbb{P}_{\nu_{a}}[Y=1]blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y = 1 ] =z𝒵p[Y=1|Z=z]qa[Z=z],absentsubscript𝑧𝒵𝑝delimited-[]𝑌conditional1𝑍𝑧subscript𝑞𝑎delimited-[]𝑍𝑧\displaystyle=\sum_{z\in\mathcal{Z}}p[Y=1|Z=z]q_{a}[Z=z],= ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT italic_p [ italic_Y = 1 | italic_Z = italic_z ] italic_q start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT [ italic_Z = italic_z ] ,
νaa0[Y=1]subscriptsubscriptsuperscript𝜈subscript𝑎0𝑎delimited-[]𝑌1\displaystyle\mathbb{P}_{\nu^{a_{0}}_{a}}[Y=1]blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y = 1 ] =z𝒵p[Y=1|Z=z]qaa0[Z=z],a𝒜formulae-sequenceabsentsubscript𝑧𝒵𝑝delimited-[]𝑌conditional1𝑍𝑧subscriptsuperscript𝑞subscript𝑎0𝑎delimited-[]𝑍𝑧for-all𝑎𝒜\displaystyle=\sum_{z\in\mathcal{Z}}p[Y=1|Z=z]q^{a_{0}}_{a}[Z=z],\forall a\in% \mathcal{A}= ∑ start_POSTSUBSCRIPT italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT italic_p [ italic_Y = 1 | italic_Z = italic_z ] italic_q start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT [ italic_Z = italic_z ] , ∀ italic_a ∈ caligraphic_A

where

qa[Z𝒵0]={1/2+2Δa=11/2a1andqaa0[Z𝒵0]={1/2+2Δa=11/2+4Δa=a01/2a1,a0,formulae-sequencesubscript𝑞𝑎delimited-[]𝑍subscript𝒵0cases122Δ𝑎112𝑎1andsubscriptsuperscript𝑞subscript𝑎0𝑎delimited-[]𝑍subscript𝒵0cases122Δ𝑎1124Δ𝑎subscript𝑎012𝑎1subscript𝑎0\displaystyle q_{a}[Z\in\mathcal{Z}_{0}]=\begin{cases}1/2+2\Delta&a=1\\ 1/2&a\neq 1\end{cases}\quad\text{and}\quad q^{a_{0}}_{a}[Z\in\mathcal{Z}_{0}]=% \begin{cases}1/2+2\Delta&a=1\\ 1/2+4\Delta&a=a_{0}\\ 1/2&a\neq 1,a_{0},\end{cases}italic_q start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT [ italic_Z ∈ caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] = { start_ROW start_CELL 1 / 2 + 2 roman_Δ end_CELL start_CELL italic_a = 1 end_CELL end_ROW start_ROW start_CELL 1 / 2 end_CELL start_CELL italic_a ≠ 1 end_CELL end_ROW and italic_q start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT [ italic_Z ∈ caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] = { start_ROW start_CELL 1 / 2 + 2 roman_Δ end_CELL start_CELL italic_a = 1 end_CELL end_ROW start_ROW start_CELL 1 / 2 + 4 roman_Δ end_CELL start_CELL italic_a = italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 / 2 end_CELL start_CELL italic_a ≠ 1 , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , end_CELL end_ROW

where probability is evenly spaced within 𝒵0subscript𝒵0\mathcal{Z}_{0}caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒵1subscript𝒵1\mathcal{Z}_{1}caligraphic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. So clearly action 1 is the only optimal action in ν𝜈\nuitalic_ν and action a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the only optimal action in νa0superscript𝜈subscript𝑎0\nu^{a_{0}}italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, with sub-optimality gap Δmin(ν)=Δmin(νa0)=ΔsubscriptΔ𝜈subscriptΔsuperscript𝜈subscript𝑎0Δ\Delta_{\min}(\nu)=\Delta_{\min}(\nu^{a_{0}})=\Deltaroman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_ν ) = roman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) = roman_Δ.

Fix algorithm 𝔞𝔸agnostic𝔞subscript𝔸agnostic\mathfrak{a}\in\mathbb{A}_{\mathrm{agnostic}}fraktur_a ∈ blackboard_A start_POSTSUBSCRIPT roman_agnostic end_POSTSUBSCRIPT with π~=𝔞(𝒜,𝒵,T,)~𝜋𝔞𝒜𝒵𝑇\tilde{\pi}=\mathfrak{a}(\mathcal{A},\mathcal{Z},T,\cdot)over~ start_ARG italic_π end_ARG = fraktur_a ( caligraphic_A , caligraphic_Z , italic_T , ⋅ ) be the actual policy implemented by 𝔞𝔞\mathfrak{a}fraktur_a. By the divergence decomposition formula (Bilodeau et al. 2022), we have that for every a01subscript𝑎01a_{0}\neq 1italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≠ 1,

KL(ν,π~νa0,π~)KLconditionalsubscript𝜈~𝜋subscriptsuperscript𝜈subscript𝑎0~𝜋\displaystyle\mathrm{KL}(\mathbb{P}_{\nu,\tilde{\pi}}\ \|\ \mathbb{P}_{\nu^{a_% {0}},\tilde{\pi}})roman_KL ( blackboard_P start_POSTSUBSCRIPT italic_ν , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT ∥ blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT ) =a𝒜𝔼ν,π~[𝕋T𝒜(a)]KL(νaνaa0)absentsubscript𝑎𝒜subscript𝔼𝜈~𝜋delimited-[]subscriptsuperscript𝕋𝒜𝑇𝑎KLconditionalsubscriptsubscript𝜈𝑎subscriptsubscriptsuperscript𝜈subscript𝑎0𝑎\displaystyle=\sum_{a\in\mathcal{A}}\mathbb{E}_{\nu,\tilde{\pi}}[\mathbb{T}^{% \mathcal{A}}_{T}(a)]\mathrm{KL}(\mathbb{P}_{\nu_{a}}\ \|\ \mathbb{P}_{\nu^{a_{% 0}}_{a}})= ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_ν , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a ) ] roman_KL ( blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
=a𝒜𝔼ν,π~[𝕋T𝒜(a)]KL(qaqaa0)absentsubscript𝑎𝒜subscript𝔼𝜈~𝜋delimited-[]subscriptsuperscript𝕋𝒜𝑇𝑎KLconditionalsubscript𝑞𝑎subscriptsuperscript𝑞subscript𝑎0𝑎\displaystyle=\sum_{a\in\mathcal{A}}\mathbb{E}_{\nu,\tilde{\pi}}[\mathbb{T}^{% \mathcal{A}}_{T}(a)]\mathrm{KL}(q_{a}\ \|\ q^{a_{0}}_{a})= ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_ν , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a ) ] roman_KL ( italic_q start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ italic_q start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT )
=𝔼ν,π~[𝕋T𝒜(a0)]KL(qa0qa0a0)absentsubscript𝔼𝜈~𝜋delimited-[]subscriptsuperscript𝕋𝒜𝑇subscript𝑎0KLconditionalsubscript𝑞subscript𝑎0subscriptsuperscript𝑞subscript𝑎0subscript𝑎0\displaystyle=\mathbb{E}_{\nu,\tilde{\pi}}[\mathbb{T}^{\mathcal{A}}_{T}(a_{0})% ]\mathrm{KL}(q_{a_{0}}\ \|\ q^{a_{0}}_{a_{0}})= blackboard_E start_POSTSUBSCRIPT italic_ν , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] roman_KL ( italic_q start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_q start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
=𝔼ν,π~[𝕋T𝒜(a0)]KL(Ber(1/2)Ber(1/2+4Δ)).absentsubscript𝔼𝜈~𝜋delimited-[]subscriptsuperscript𝕋𝒜𝑇subscript𝑎0KLconditionalBer12Ber124Δ\displaystyle=\mathbb{E}_{\nu,\tilde{\pi}}[\mathbb{T}^{\mathcal{A}}_{T}(a_{0})% ]\mathrm{KL}(\mathrm{Ber}(1/2)\ \|\ \mathrm{Ber}(1/2+4\Delta)).= blackboard_E start_POSTSUBSCRIPT italic_ν , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] roman_KL ( roman_Ber ( 1 / 2 ) ∥ roman_Ber ( 1 / 2 + 4 roman_Δ ) ) .

By Bretagnolle–Huber inequality,

𝔼ν,π~[Reg(T)]+𝔼νa0,π~[Reg(T)]subscript𝔼𝜈~𝜋delimited-[]Reg𝑇subscript𝔼superscript𝜈subscript𝑎0~𝜋delimited-[]Reg𝑇\displaystyle\mathbb{E}_{\nu,\tilde{\pi}}[\mathrm{Reg}(T)]+\mathbb{E}_{\nu^{a_% {0}},\tilde{\pi}}[\mathrm{Reg}(T)]blackboard_E start_POSTSUBSCRIPT italic_ν , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] + blackboard_E start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] TΔ2(ν,π~[𝕋T𝒜(1)T/2]+νa0,π~[𝕋T𝒜(1)>T/2])absent𝑇Δ2subscript𝜈~𝜋delimited-[]subscriptsuperscript𝕋𝒜𝑇1𝑇2subscriptsuperscript𝜈subscript𝑎0~𝜋delimited-[]subscriptsuperscript𝕋𝒜𝑇1𝑇2\displaystyle\geq\frac{T\Delta}{2}\mathopen{}\left(\mathbb{P}_{\nu,\tilde{\pi}% }[\mathbb{T}^{\mathcal{A}}_{T}(1)\leq T/2]+\mathbb{P}_{\nu^{a_{0}},\tilde{\pi}% }[\mathbb{T}^{\mathcal{A}}_{T}(1)>T/2]\right)≥ divide start_ARG italic_T roman_Δ end_ARG start_ARG 2 end_ARG ( blackboard_P start_POSTSUBSCRIPT italic_ν , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( 1 ) ≤ italic_T / 2 ] + blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( 1 ) > italic_T / 2 ] )
TΔ4exp(KL(ν,π~νa0,π~))absent𝑇Δ4KLconditionalsubscript𝜈~𝜋subscriptsuperscript𝜈subscript𝑎0~𝜋\displaystyle\geq\frac{T\Delta}{4}\exp(-\mathrm{KL}(\mathbb{P}_{\nu,\tilde{\pi% }}\ \|\ \mathbb{P}_{\nu^{a_{0}},\tilde{\pi}}))≥ divide start_ARG italic_T roman_Δ end_ARG start_ARG 4 end_ARG roman_exp ( - roman_KL ( blackboard_P start_POSTSUBSCRIPT italic_ν , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT ∥ blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT ) )
=TΔ4exp(𝔼ν,π~[𝕋T𝒜(a0)]KL(Ber(1/2)Ber(1/2+4Δ))).absent𝑇Δ4subscript𝔼𝜈~𝜋delimited-[]subscriptsuperscript𝕋𝒜𝑇subscript𝑎0KLconditionalBer12Ber124Δ\displaystyle=\frac{T\Delta}{4}\exp(-\mathbb{E}_{\nu,\tilde{\pi}}[\mathbb{T}^{% \mathcal{A}}_{T}(a_{0})]\mathrm{KL}(\mathrm{Ber}(1/2)\ \|\ \mathrm{Ber}(1/2+4% \Delta))).= divide start_ARG italic_T roman_Δ end_ARG start_ARG 4 end_ARG roman_exp ( - blackboard_E start_POSTSUBSCRIPT italic_ν , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] roman_KL ( roman_Ber ( 1 / 2 ) ∥ roman_Ber ( 1 / 2 + 4 roman_Δ ) ) ) .

Now we pick a0argmina1𝔼ν,π~[𝕋T𝒜(a)]subscript𝑎0subscriptarg𝑎1subscript𝔼𝜈~𝜋delimited-[]subscriptsuperscript𝕋𝒜𝑇𝑎a_{0}\in\operatorname*{arg\!\min}_{a\neq 1}\mathbb{E}_{\nu,\tilde{\pi}}[% \mathbb{T}^{\mathcal{A}}_{T}(a)]italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_a ≠ 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_ν , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a ) ] which implies that 𝔼ν,π~[𝕋T𝒜(a0)]T|𝒜|1subscript𝔼𝜈~𝜋delimited-[]subscriptsuperscript𝕋𝒜𝑇subscript𝑎0𝑇𝒜1\mathbb{E}_{\nu,\tilde{\pi}}[\mathbb{T}^{\mathcal{A}}_{T}(a_{0})]\leq\frac{T}{% |\mathcal{A}|-1}blackboard_E start_POSTSUBSCRIPT italic_ν , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ blackboard_T start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ≤ divide start_ARG italic_T end_ARG start_ARG | caligraphic_A | - 1 end_ARG. Also KL(Ber(1/2)Ber(1/2+4Δ))4(4Δ)2=64Δ2KLconditionalBer12Ber124Δ4superscript4Δ264superscriptΔ2\mathrm{KL}(\mathrm{Ber}(1/2)\ \|\ \mathrm{Ber}(1/2+4\Delta))\leq 4(4\Delta)^{% 2}=64\Delta^{2}roman_KL ( roman_Ber ( 1 / 2 ) ∥ roman_Ber ( 1 / 2 + 4 roman_Δ ) ) ≤ 4 ( 4 roman_Δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 64 roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for Δ<1/40Δ140\Delta<1/40roman_Δ < 1 / 40. So

𝔼ν,π~[Reg(T)]+𝔼νa0,π~[Reg(T)]TΔ4exp(64TΔ2|𝒜|1).subscript𝔼𝜈~𝜋delimited-[]Reg𝑇subscript𝔼superscript𝜈subscript𝑎0~𝜋delimited-[]Reg𝑇𝑇Δ464𝑇superscriptΔ2𝒜1\displaystyle\mathbb{E}_{\nu,\tilde{\pi}}[\mathrm{Reg}(T)]+\mathbb{E}_{\nu^{a_% {0}},\tilde{\pi}}[\mathrm{Reg}(T)]\geq\frac{T\Delta}{4}\exp\mathopen{}\left(-% \frac{64T\Delta^{2}}{|\mathcal{A}|-1}\right).blackboard_E start_POSTSUBSCRIPT italic_ν , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] + blackboard_E start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] ≥ divide start_ARG italic_T roman_Δ end_ARG start_ARG 4 end_ARG roman_exp ( - divide start_ARG 64 italic_T roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | caligraphic_A | - 1 end_ARG ) .

Taking Δ=140|𝒜|1TΔ140𝒜1𝑇\Delta=\frac{1}{40}\sqrt{\frac{|\mathcal{A}|-1}{T}}roman_Δ = divide start_ARG 1 end_ARG start_ARG 40 end_ARG square-root start_ARG divide start_ARG | caligraphic_A | - 1 end_ARG start_ARG italic_T end_ARG end_ARG, we know that max{𝔼ν,π~[Reg(T)],𝔼νa0,π~[Reg(T)]}12(𝔼ν,π~[Reg(T)]+𝔼νa0,π~[Reg(T)])c|𝒜|Tsubscript𝔼𝜈~𝜋delimited-[]Reg𝑇subscript𝔼superscript𝜈subscript𝑎0~𝜋delimited-[]Reg𝑇12subscript𝔼𝜈~𝜋delimited-[]Reg𝑇subscript𝔼superscript𝜈subscript𝑎0~𝜋delimited-[]Reg𝑇𝑐𝒜𝑇\max\mathopen{}\left\{\mathbb{E}_{\nu,\tilde{\pi}}[\mathrm{Reg}(T)],\mathbb{E}% _{\nu^{a_{0}},\tilde{\pi}}[\mathrm{Reg}(T)]\right\}\geq\frac{1}{2}(\mathbb{E}_% {\nu,\tilde{\pi}}[\mathrm{Reg}(T)]+\mathbb{E}_{\nu^{a_{0}},\tilde{\pi}}[% \mathrm{Reg}(T)])\geq c\sqrt{|\mathcal{A}|T}roman_max { blackboard_E start_POSTSUBSCRIPT italic_ν , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] , blackboard_E start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] } ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_ν , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] + blackboard_E start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ roman_Reg ( italic_T ) ] ) ≥ italic_c square-root start_ARG | caligraphic_A | italic_T end_ARG for some absolute constant c>0𝑐0c>0italic_c > 0, which yields the claim. ∎

In the above proof, it is easy to see that ν(Z)𝜈𝑍\nu(Z)italic_ν ( italic_Z ) and νa0(Z)superscript𝜈subscript𝑎0𝑍\nu^{a_{0}}(Z)italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_Z ) are εlimit-from𝜀\varepsilon-italic_ε -close, where ε=c|𝒜|/T𝜀𝑐𝒜𝑇\varepsilon=c\cdot\sqrt{|\mathcal{A}|/T}italic_ε = italic_c ⋅ square-root start_ARG | caligraphic_A | / italic_T end_ARG for some absolute constant c𝑐citalic_c. So for any algorithm input by ν~(Z)=ν(Z)~𝜈𝑍𝜈𝑍\tilde{\nu}(Z)=\nu(Z)over~ start_ARG italic_ν end_ARG ( italic_Z ) = italic_ν ( italic_Z ) when interacting with ν¯{ν,νa0,a01}¯𝜈𝜈superscript𝜈subscript𝑎0subscript𝑎01\bar{\nu}\in\mathopen{}\left\{\nu,\nu^{a_{0}},a_{0}\neq 1\right\}over¯ start_ARG italic_ν end_ARG ∈ { italic_ν , italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≠ 1 }, it is satisfied that ν~(Z)~𝜈𝑍\tilde{\nu}(Z)over~ start_ARG italic_ν end_ARG ( italic_Z ) and ν¯(Z)¯𝜈𝑍\bar{\nu}(Z)over¯ start_ARG italic_ν end_ARG ( italic_Z ) are always εlimit-from𝜀\varepsilon-italic_ε -close, but the algorithm incurs Ω(|𝒜|T)Ω𝒜𝑇\Omega(\sqrt{|\mathcal{A}|T})roman_Ω ( square-root start_ARG | caligraphic_A | italic_T end_ARG ) regret in some instance from {ν,νa0,a01}𝜈superscript𝜈subscript𝑎0subscript𝑎01\mathopen{}\left\{\nu,\nu^{a_{0}},a_{0}\neq 1\right\}{ italic_ν , italic_ν start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≠ 1 }.

Appendix E Instances where PEPE\operatorname{PE}roman_PE incurs linear regret

In this section we give an example for PEPE\operatorname{PE}roman_PE to illustrate that to merely force linear regret on a causal bandit algorithm, we need to construct non-benign instances carefully and re-code the algorithm to ensure its erratic behavior in those instances. In particular, we construct a non-benign environment ν𝜈\nuitalic_ν for every Δ(0,1)Δ01\Delta\in(0,1)roman_Δ ∈ ( 0 , 1 ) such that Δmin(ν)=ΔsubscriptΔ𝜈Δ\Delta_{\min}(\nu)=\Deltaroman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_ν ) = roman_Δ while the re-coded PEPE\operatorname{PE}roman_PE never plays the optimal arm.

Proposition E.1.

Suppose we modify Algorithm 2 such that, in each phase, we always choose an exact-optimal design whenever feasible. For any 𝒜,𝒵𝒜𝒵\mathcal{A},\mathcal{Z}caligraphic_A , caligraphic_Z and T𝑇Titalic_T with |𝒜|>|𝒵|3𝒜𝒵3|\mathcal{A}|>|\mathcal{Z}|\geq 3| caligraphic_A | > | caligraphic_Z | ≥ 3 and Δ(0,1)Δ01\Delta\in(0,1)roman_Δ ∈ ( 0 , 1 ), there exists a non-benign environment ν𝜈\nuitalic_ν such that Δmin(ν)=ΔsubscriptΔ𝜈Δ\Delta_{\min}(\nu)=\Deltaroman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_ν ) = roman_Δ, while Algorithm 2 will never play the optimal arm, hence incurring linear regret,

Reg(t)Δmin(ν)t=Δt,t[T].formulae-sequenceReg𝑡subscriptΔ𝜈𝑡Δ𝑡for-all𝑡delimited-[]𝑇\displaystyle\mathrm{Reg}(t)\geq\Delta_{\min}(\nu)\cdot t=\Delta\cdot t,\quad% \forall t\in[T].roman_Reg ( italic_t ) ≥ roman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_ν ) ⋅ italic_t = roman_Δ ⋅ italic_t , ∀ italic_t ∈ [ italic_T ] .
Proof of Proposition E.1.

For any 𝒜𝒜\mathcal{A}caligraphic_A and 𝒵𝒵\mathcal{Z}caligraphic_Z with |𝒜|>|𝒵|3𝒜𝒵3|\mathcal{A}|>|\mathcal{Z}|\geq 3| caligraphic_A | > | caligraphic_Z | ≥ 3, suppose we index the contexts in arbitrary way such that 𝒵={z1,,z|𝒵|}𝒵subscript𝑧1subscript𝑧𝒵\mathcal{Z}=\mathopen{}\left\{z_{1},...,z_{|\mathcal{Z}|}\right\}caligraphic_Z = { italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT | caligraphic_Z | end_POSTSUBSCRIPT }, and we pick (|𝒵|+1)𝒵1(|\mathcal{Z}|+1)( | caligraphic_Z | + 1 ) number of arms from 𝒜𝒜\mathcal{A}caligraphic_A and denote them by a,a1,,a|𝒵|superscript𝑎subscript𝑎1subscript𝑎𝒵a^{*},a_{1},...,a_{|\mathcal{Z}|}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT | caligraphic_Z | end_POSTSUBSCRIPT. Construct marginals νasubscript𝜈𝑎\nu_{a}italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT as follows:

νai(Z)subscript𝜈subscript𝑎𝑖𝑍\displaystyle\nu_{a_{i}}(Z)italic_ν start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Z ) =δ{zi}=:ei,i[|𝒵|],\displaystyle=\delta_{\mathopen{}\left\{z_{i}\right\}}=:e_{i},\quad i\in[|% \mathcal{Z}|],= italic_δ start_POSTSUBSCRIPT { italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT = : italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ | caligraphic_Z | ] ,
νa(Z)subscript𝜈superscript𝑎𝑍\displaystyle\nu_{a^{*}}(Z)italic_ν start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ) =12(δ{z1}+δ{z2})=12(e1+e2),absent12subscript𝛿subscript𝑧1subscript𝛿subscript𝑧212subscript𝑒1subscript𝑒2\displaystyle=\frac{1}{2}(\delta_{\mathopen{}\left\{z_{1}\right\}}+\delta_{% \mathopen{}\left\{z_{2}\right\}})=\frac{1}{2}(e_{1}+e_{2}),= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_δ start_POSTSUBSCRIPT { italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT { italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,

where we write marginal distributions over 𝒵𝒵\mathcal{Z}caligraphic_Z as vectors in |𝒵|superscript𝒵\mathbb{R}^{|\mathcal{Z}|}blackboard_R start_POSTSUPERSCRIPT | caligraphic_Z | end_POSTSUPERSCRIPT according to context indices. Then define conditional distributions νa(Y|Z)subscriptsubscript𝜈𝑎conditional𝑌𝑍\mathbb{P}_{\nu_{a}}(Y|Z)blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Y | italic_Z ):

νai[Y|Z=zi]={δ{0}i[|𝒵|1]δ{1Δ}i=|𝒵|subscriptsubscript𝜈subscript𝑎𝑖delimited-[]conditional𝑌𝑍subscript𝑧𝑖casessubscript𝛿0𝑖delimited-[]𝒵1subscript𝛿1Δ𝑖𝒵\displaystyle\mathbb{P}_{\nu_{a_{i}}}[Y|Z=z_{i}]=\begin{cases}\delta_{% \mathopen{}\left\{0\right\}}&i\in[|\mathcal{Z}|-1]\\ \delta_{\mathopen{}\left\{1-\Delta\right\}}&i=|\mathcal{Z}|\end{cases}blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_Z = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = { start_ROW start_CELL italic_δ start_POSTSUBSCRIPT { 0 } end_POSTSUBSCRIPT end_CELL start_CELL italic_i ∈ [ | caligraphic_Z | - 1 ] end_CELL end_ROW start_ROW start_CELL italic_δ start_POSTSUBSCRIPT { 1 - roman_Δ } end_POSTSUBSCRIPT end_CELL start_CELL italic_i = | caligraphic_Z | end_CELL end_ROW

and

νa[Y|Z=z1]subscriptsubscript𝜈superscript𝑎delimited-[]conditional𝑌𝑍subscript𝑧1\displaystyle\mathbb{P}_{\nu_{a^{*}}}[Y|Z=z_{1}]blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_Z = italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] =νa[Y|Z=z2]=δ{1}.absentsubscriptsubscript𝜈superscript𝑎delimited-[]conditional𝑌𝑍subscript𝑧2subscript𝛿1\displaystyle=\mathbb{P}_{\nu_{a^{*}}}[Y|Z=z_{2}]=\delta_{\mathopen{}\left\{1% \right\}}.= blackboard_P start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_Z = italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = italic_δ start_POSTSUBSCRIPT { 1 } end_POSTSUBSCRIPT .

In other words, playing arm aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT yields context zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and deterministic reward, while we could observe z1subscript𝑧1z_{1}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with equal probability and always get the optimal reward by playing arm asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. So the only optimal arm for ν𝜈\nuitalic_ν is asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with Δmin(ν)=ΔsubscriptΔ𝜈Δ\Delta_{\min}(\nu)=\Deltaroman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_ν ) = roman_Δ. We can treat all other a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A as dummy actions by identifying each of them with one of a,ai,i[|𝒵|]superscript𝑎subscript𝑎𝑖𝑖delimited-[]𝒵a^{*},a_{i},i\in[|\mathcal{Z}|]italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ | caligraphic_Z | ] arbitrarily.

Next we will verify the following facts. (1) When no action is eliminated and 𝒜=𝒜subscript𝒜𝒜\mathcal{A}_{\ell}=\mathcal{A}caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = caligraphic_A, any exact G-optimal design π𝒫(𝒜)subscript𝜋𝒫subscript𝒜\pi_{\ell}\in\mathcal{P}(\mathcal{A}_{\ell})italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∈ caligraphic_P ( caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) does not have positive mass over asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. (2) Whenever any action is eliminated in the end of phase \ellroman_ℓ, it must be that all actions except for a|𝒵|subscript𝑎𝒵a_{|\mathcal{Z}|}italic_a start_POSTSUBSCRIPT | caligraphic_Z | end_POSTSUBSCRIPT are eliminated as well. Then PE would just play a|𝒵|subscript𝑎𝒵a_{|\mathcal{Z}|}italic_a start_POSTSUBSCRIPT | caligraphic_Z | end_POSTSUBSCRIPT till the end. Combining these two facts we can conclude that PE never picks asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT during the interaction with ν𝜈\nuitalic_ν.

No G-optimal design is supported on asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Recall that any G-optimal design πsubscript𝜋\pi_{\ell}italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT maximizes f(π)=logdetV(π)𝑓𝜋𝑉𝜋f(\pi)=\log\det V(\pi)italic_f ( italic_π ) = roman_log roman_det italic_V ( italic_π ), where V(π)=a𝒜π(a)νaνa𝑉𝜋subscript𝑎subscript𝒜𝜋𝑎subscript𝜈𝑎superscriptsubscript𝜈𝑎topV(\pi)=\sum_{a\in\mathcal{A}_{\ell}}\pi(a)\nu_{a}\nu_{a}^{\top}italic_V ( italic_π ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_π ( italic_a ) italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over π𝒫(𝒜)𝜋𝒫subscript𝒜\pi\in\mathcal{P}(\mathcal{A}_{\ell})italic_π ∈ caligraphic_P ( caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) (Lattimore & Szepesvári, 2020, Theorem 21.1). When 𝒜=𝒜subscript𝒜𝒜\mathcal{A}_{\ell}=\mathcal{A}caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = caligraphic_A, detV(π)𝑉𝜋\det V(\pi)roman_det italic_V ( italic_π ) can be computed as

detV(π)=(π(a1)π(a2)+π(a)4(π(a1)+π(a2)))π(a3)π(a|𝒵|).𝑉𝜋𝜋subscript𝑎1𝜋subscript𝑎2𝜋superscript𝑎4𝜋subscript𝑎1𝜋subscript𝑎2𝜋subscript𝑎3𝜋subscript𝑎𝒵\displaystyle\det V(\pi)=\mathopen{}\left(\pi(a_{1})\pi(a_{2})+\frac{\pi(a^{*}% )}{4}(\pi(a_{1})+\pi(a_{2}))\right)\pi(a_{3})\cdots\pi(a_{|\mathcal{Z}|}).roman_det italic_V ( italic_π ) = ( italic_π ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_π ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + divide start_ARG italic_π ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG 4 end_ARG ( italic_π ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_π ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ) italic_π ( italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ⋯ italic_π ( italic_a start_POSTSUBSCRIPT | caligraphic_Z | end_POSTSUBSCRIPT ) .

Then we can find that any maximizing π𝜋\piitalic_π should have π(a)=0𝜋superscript𝑎0\pi(a^{*})=0italic_π ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0 after realizing that π(a1)=π(a2)𝜋subscript𝑎1𝜋subscript𝑎2\pi(a_{1})=\pi(a_{2})italic_π ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_π ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) for such π𝜋\piitalic_π. Moreover, there is only one G-optimal design in this case, which is π=Unif(a1,,a|𝒵|)subscript𝜋𝑈𝑛𝑖𝑓subscript𝑎1subscript𝑎𝒵\pi_{\ell}=Unif(a_{1},...,a_{|\mathcal{Z}|})italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = italic_U italic_n italic_i italic_f ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT | caligraphic_Z | end_POSTSUBSCRIPT ).

All actions other than a|𝒵|subscript𝑎𝒵a_{|\mathcal{Z}|}italic_a start_POSTSUBSCRIPT | caligraphic_Z | end_POSTSUBSCRIPT would be eliminated at the same time.

If the first elimination happens in the end of phase \ellroman_ℓ, then we must have μ^𝒵=(0,,0,1Δ)subscriptsuperscript^𝜇𝒵superscript001Δtop\hat{\mu}^{\mathcal{Z}}_{\ell}=(0,...,0,1-\Delta)^{\top}over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = ( 0 , … , 0 , 1 - roman_Δ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT due to that π=Unif(a1,,a|𝒵|)subscript𝜋𝑈𝑛𝑖𝑓subscript𝑎1subscript𝑎𝒵\pi_{\ell}=Unif(a_{1},...,a_{|\mathcal{Z}|})italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = italic_U italic_n italic_i italic_f ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT | caligraphic_Z | end_POSTSUBSCRIPT ) and rewards are deterministic. So maxb𝒜μ^𝒵,νbνasubscript𝑏subscript𝒜subscriptsuperscript^𝜇𝒵subscript𝜈𝑏subscript𝜈𝑎\max_{b\in\mathcal{A}_{\ell}}\langle\hat{\mu}^{\mathcal{Z}}_{\ell},\nu_{b}-\nu% _{a}\rangleroman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT caligraphic_Z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - italic_ν start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ is 1Δ1Δ1-\Delta1 - roman_Δ for all aa|𝒵|𝑎subscript𝑎𝒵a\neq a_{|\mathcal{Z}|}italic_a ≠ italic_a start_POSTSUBSCRIPT | caligraphic_Z | end_POSTSUBSCRIPT and 00 for a=a|𝒵|𝑎subscript𝑎𝒵a=a_{|\mathcal{Z}|}italic_a = italic_a start_POSTSUBSCRIPT | caligraphic_Z | end_POSTSUBSCRIPT. Then the elimination must happen within a,a1,,a|𝒵|1superscript𝑎subscript𝑎1subscript𝑎𝒵1a^{*},a_{1},...,a_{|\mathcal{Z}|-1}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT | caligraphic_Z | - 1 end_POSTSUBSCRIPT, and thus every one of it should be eliminated simultaneously. ∎