Safe Linear Bandits over Unknown Polytopes

Aditya Gangrade1,2, Tianrui Chen1, Venkatesh Saligrama1
1Boston University, 2University of Michigan
{gangrade, trchen, srv}@bu.edu
Abstract

The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown roundwise constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive doubly-optimistic play in avoiding the strong assumptions made by extant pessimistic-optimistic approaches.

We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist ‘easy’ instances, for which suboptimal extreme points have large ‘gaps’, but on which SLB methods must still incur Ω(T)Ω𝑇\Omega(\sqrt{T})roman_Ω ( square-root start_ARG italic_T end_ARG ) regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, doss, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, doss simultaneously obtains tight instance-dependent O(log2T)𝑂superscript2𝑇O(\log^{2}T)italic_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) bounds on efficacy regret, and O~(T)~𝑂𝑇\widetilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) bounds on safety violations, thus attaining near Pareto-optimality. Further, when safety is demanded to a finite precision, violations improve to O(log2T).𝑂superscript2𝑇O(\log^{2}T).italic_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) . These results rely on a novel dual analysis of linear bandits: we argue that doss proceeds by activating noisy versions of at least d𝑑ditalic_d constraints in each round, which allows us to separately analyse rounds where a ‘poor’ set of constraints is activated, and rounds where ‘good’ sets of constraints are activated. The costs in the former are controlled to O(log2T)𝑂superscript2𝑇O(\log^{2}T)italic_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) by develo** new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to O(1)𝑂1O(1)italic_O ( 1 ) by explicitly analysing the solutions of optimistic play.

1 Introduction

The Safe Linear Bandit (SLB) problem: Consider a linear program maxθx:Axα:superscript𝜃top𝑥𝐴𝑥𝛼\max\theta^{\top}x:Ax\leq\alpharoman_max italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x : italic_A italic_x ≤ italic_α where the feasible set 𝒮:={Axα}assign𝒮𝐴𝑥𝛼\mathcal{S}:=\{Ax\leq\alpha\}caligraphic_S := { italic_A italic_x ≤ italic_α } is known to be a nonempty bounded polytope in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, but neither the objective θd𝜃superscript𝑑\theta\in\mathbb{R}^{d}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, nor the constraint matrix Am×d𝐴superscript𝑚𝑑A\in\mathbb{R}^{m\times d}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT are completely known a priori, and no action known a priori to be safe (i.e., feasible) is available. Instead, a learner sequentially picks actions xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, with the goal of choosing xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that are effective and safe in each round. Learning is enabled through stochastic bandit feedback in the form of a reward signal Rt=θ,xt+wtRsubscript𝑅𝑡𝜃subscript𝑥𝑡superscriptsubscript𝑤𝑡𝑅R_{t}=\langle\theta,x_{t}\rangle+w_{t}^{R}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT and a risk signal St:𝔼[St|xt]=Axt+wtS:subscript𝑆𝑡𝔼delimited-[]conditionalsubscript𝑆𝑡subscript𝑥𝑡𝐴subscript𝑥𝑡superscriptsubscript𝑤𝑡𝑆S_{t}:\mathbb{E}[S_{t}|x_{t}]=Ax_{t}+w_{t}^{S}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : blackboard_E [ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = italic_A italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT where (wtR,wtS)superscriptsubscript𝑤𝑡𝑅superscriptsubscript𝑤𝑡𝑆(w_{t}^{R},w_{t}^{S})( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) is a noise process.

Ideally, we would explore only in 𝒮,𝒮\mathcal{S},caligraphic_S , but since we do not know it (or any safe point) to start with, some safety violation must necessarily occur over the course of learning. It is natural in many applications to penalise such violation ‘softly’. With this view, we measure the performance of the learner over T𝑇Titalic_T rounds through the efficacy regret, T,subscript𝑇\mathscr{E}_{T},script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , and the net safety violation 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT defined as

T:=tTθ,xxt)+ and 𝒮T:=tT(maxiai,xtαi)+,\mathscr{E}_{T}:=\sum_{t\leq T}\langle\theta,x^{*}-x_{t}\rangle)_{+}\quad% \textit{ and }\quad\mathscr{S}_{T}:=\sum_{t\leq T}(\max_{i}\langle a^{i},x_{t}% \rangle-\alpha^{i})_{+},script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , (1)

where (z)+:=max(z,0)assignsubscript𝑧𝑧0(z)_{+}:=\max(z,0)( italic_z ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT := roman_max ( italic_z , 0 ), and xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the constrained optimum. These same 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT metrics were proposed in the finite-armed setting by Efroni et al. (2020) and Chen et al. (2022). The main structural property that makes T,𝒮Tsubscript𝑇subscript𝒮𝑇\mathscr{E}_{T},\mathscr{S}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT pertinent in the roundwise scenario is that they accumulate only the positive parts of the roundwise inefficiency or safety violation. Indeed, since Tsubscript𝑇\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT sums over (θ,xxt)+,subscript𝜃superscript𝑥subscript𝑥𝑡(\langle\theta,x^{*}-x_{t}\rangle)_{+},( ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , playing any xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with better reward than xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT leads to no decrease in it, and instead it increases 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT since such an xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must be infeasible. Conversely, since 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT sums the largest roundwise violations, playing a safe but under-effective xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT increases Tsubscript𝑇\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT but does not reduce 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Thus, the only way to make both Tsubscript𝑇\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT small is to ensure that most xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTs are near-safe and near-optimal. We note that the choice of the linear penalty on violations above is just out of convenience: any penalty of the form f(maxi(ai,xtαi)+),f(\max_{i}(\langle a^{i},x_{t}\rangle-\alpha^{i})_{+}),italic_f ( roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) , where f𝑓fitalic_f smoothly decays to 00 near 0+,superscript00^{+},0 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , is amenable to our analysis (§G).

Motivating Examples.

The interplay of unknown rewards and constraints is a common feature of application domains of bandits. In drug trials, one needs to balance the efficacy of a treatment regimen with its risk of various side-effects (i.e., the probabilities that it induces harmful side-effects); in crowdsourcing, one must balance the cost of completing tasks with the quality of the resulting work; and recommmender systems must balance the click-rate of suggestions with their effects on engagement (such as watch-time or revisiting rates). In such cases, we must enforce the constraint in each round, e.g., completing one task well does not license us to be sloppy on the next. Further, it is nontrivial to find a feasible starting point, since, e.g., this requires knowing worker quality distributions a priori, or knowing which compounds balance the side-effects of active compounds a priori. Nevertheless, soft enforcement is meaningful, e.g., if the risk of a side-effect is slightly over α𝛼\alphaitalic_α, this only leads to a slight increase in overall numbers of adverse effects realised; and a slight reduction in the mean watch-time is an acceptable price for learning. Thus strong control on 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ensures that in the long run, the system performs arbitrarily close to safety.

Soft Roundwise Enforcement over Polytopes.

We focus on understanding what performance can be achieved while ensuring that 𝒮T=o(T)subscript𝒮𝑇𝑜𝑇\mathscr{S}_{T}=o(T)script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_o ( italic_T ). At the first glance, one expects control of the form max(T,𝒮T)=O~(T),subscript𝑇subscript𝒮𝑇~𝑂𝑇\max(\mathscr{E}_{T},\mathscr{S}_{T})=\widetilde{O}(\sqrt{T}),roman_max ( script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) , which indeed follows from standard techniques (§4). However, this question is most interesting in a refined sense: since we are work over a polytopal domain,111While obvious, let us explicitly note here that the problem over polytopal domains is of significant importance, since this corresponds to the ubiquitous questions of linear programming. prior work on linear bandits tells us that if 𝒮𝒮\mathcal{S}caligraphic_S were known, one can derive instance-dependent bounds of O(log2T)𝑂superscript2𝑇O(\log^{2}T)italic_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) on T,𝒮Tsubscript𝑇subscript𝒮𝑇\mathscr{E}_{T},\mathscr{S}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT (e.g. Abbasi-Yadkori et al., 2011). This paper is concerned with the question

Over polytopal domains, is it possible to attain instance-dependent polylogarithmic bounds on Tsubscript𝑇\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT without knowing 𝒮𝒮\mathcal{S}caligraphic_S in advance?
Our Contributions

approach this by studying the efficacy-safety tradeoffs for SLBs, and by analysing a natural doubly-optimistic method for the same. Concretely, we show that

  • Simultaneous logarithmic control on Tsubscript𝑇\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is impossible. We show that for any SLB algorithm, there exists an instance with large ‘gaps’ on which the algorithm incurs max(T,𝒮T)=Ω(T).subscript𝑇subscript𝒮𝑇Ω𝑇\max(\mathscr{E}_{T},\mathscr{S}_{T})=\Omega(\sqrt{T}).roman_max ( script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = roman_Ω ( square-root start_ARG italic_T end_ARG ) . The key property of these instances is the large, i.e., Ω(1)Ω1\Omega(1)roman_Ω ( 1 ) gap, and due to this gap each instance could be solved max(T,𝒮T)=O(log2T)subscript𝑇subscript𝒮𝑇𝑂superscript2𝑇\max(\mathscr{E}_{T},\mathscr{S}_{T})=O(\log^{2}T)roman_max ( script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) if the feasible set 𝒮𝒮\mathcal{S}caligraphic_S were known (§5). However, a polynomial lower bound arises since the lack of knowledge of 𝒮𝒮\mathcal{S}caligraphic_S induces a ‘precision barrier,’ that is, the fact that no method can locate effective and safe actions to precision better than t1/2superscript𝑡12t^{-1/2}italic_t start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT after t𝑡titalic_t rounds of play. This same barrier also renders the standard primal approach of analysing polytopal linear bandits via their extreme points ineffective for SLBs(Fig. 1, left). We further note that the constructed instances are simple enough to embed into any nontrivial set of SLB instances, making the result generic rather than specific to the particular situation we study.

  • Nevertheless, doubly-optimistic (DO) methods can attain T=O(log2T)subscript𝑇𝑂superscript2𝑇\mathscr{E}_{T}=O(\log^{2}T)script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) and 𝒮T=O~(T)subscript𝒮𝑇~𝑂𝑇\mathscr{S}_{T}=\widetilde{O}(\sqrt{T})script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ). Specifically, we show that these bounds are attained by the DO method doss 3.2), which generalises the finite-armed approach of Efroni et al. (2020) and Chen et al. (2022), and has been studied for aggregate enforcement (see below) by Agrawal and Devanur (2014). doss builds an ‘optimistic’ estimate 𝒮~tsubscript~𝒮𝑡\widetilde{\mathcal{S}}_{t}over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of 𝒮𝒮\mathcal{S}caligraphic_S, and selects actions optimistically over the same. Since these bounds match our lower bounds up to polylog-factors, doss is near-Pareto-optimal for SLBs.

  • The aforementioned precision barrier is the sole obstruction to logarithmic bounds. We argue that in important special cases, doss, with either no or mild changes, attains max(T,𝒮T)=O(log2T)subscript𝑇subscript𝒮𝑇𝑂superscript2𝑇\max(\mathscr{E}_{T},\mathscr{S}_{T})=O(\log^{2}T)roman_max ( script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ). The key property of such settings is an innate way to avoid having to identify good primal actions to arbitrary precision, illustrating that this is the key obstruction in SLBs.

Refer to caption
Refer to caption
Figure 1: The challenge, and our approach. Left. The usual primal view of linear bandits over polytopes breaks down, since noisy estimates of the unknown A𝐴Aitalic_A induce a continuum of potential locations for extreme points (red blobs). Right Taking a dual linear programming view, we can identify extreme points as arising by saturating d𝑑ditalic_d independent constraints. We generalise this view by showing that doss plays by saturating noisy versions of d𝑑ditalic_d constraints. Poor play can arise from picking the wrong set of constraints (blue), or using a poor estimate for the right set of constraints (red).
Technical novelty

of the paper lies in the analysis of doss. Since the primal approach for obtaining polylog regret in linear bandits fails, we instead approach the problem through a novel dual analysis, that exploits the fact that extreme points of polytopes can be dually viewed as points saturating d𝑑ditalic_d constraints (Fig. 1, right). We show that this view generalises, i.e., doss picks actions by saturating a noisy version of d𝑑ditalic_d constraints. This allows us to break the analysis into two threads
a)a combinatorial identification problem of whether the ‘right’ set of constraints is saturated, and
b) whether effective points are played when the ‘optimal’ sets of constraints are saturated.

The efficacy loss due to the former is controlled to O(log2T)𝑂superscript2𝑇O(\log^{2}T)italic_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) by develo** a novel notion of ‘dual gap’ associated with each ‘poor’ set of constraints, which arise via a global LP sensitivity analysis approach. The second issue is handled via a careful analysis of optimistic play to argue that under mild nondegeneracy assumptions, it cannot play ineffective actions when saturating the ‘optimal’ set of constraints, which controls the efficacy loss due to such play to O(1)𝑂1O(1)italic_O ( 1 ).

1.1 Related Work

We briefly describe the two main lines of work on constrained bandits (also see §A).

Hard Roundwise Enforcement.

Instead of the soft sense we study, one can demand roundwise enforcement in a hard sense, requiring that with high probability (whp), the constraints always be met, i.e., whp, 𝒮t=0subscript𝒮𝑡0\mathscr{S}_{t}=0script_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0. Since this is clearly not possible without knowing a safe point to start with, methods along these lines usually assume a priori knowledge of a point x𝗌superscript𝑥𝗌x^{\mathsf{s}}italic_x start_POSTSUPERSCRIPT sansserif_s end_POSTSUPERSCRIPT in the interior of 𝒮𝒮\mathcal{S}caligraphic_S, i.e., with positive safety margin M𝗌:=maxi(ai,x𝗌αi)assignsuperscript𝑀𝗌subscript𝑖superscript𝑎𝑖superscript𝑥𝗌superscript𝛼𝑖M^{\mathsf{s}}:=-\max_{i}(\langle a^{i},x^{\mathsf{s}}\rangle-\alpha^{i})italic_M start_POSTSUPERSCRIPT sansserif_s end_POSTSUPERSCRIPT := - roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT sansserif_s end_POSTSUPERSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ). Given the knowledge of (x𝗌,M𝗌),superscript𝑥𝗌superscript𝑀𝗌(x^{\mathsf{s}},M^{\mathsf{s}}),( italic_x start_POSTSUPERSCRIPT sansserif_s end_POSTSUPERSCRIPT , italic_M start_POSTSUPERSCRIPT sansserif_s end_POSTSUPERSCRIPT ) , recent lines of work (Amani et al., 2019; Moradipari et al., 2021; Pacchiano et al., 2021; Afsharrad et al., 2023; Hutchinson et al., 2023; Varma et al., 2023; Pacchiano et al., 2024) have proposed various ‘pessimistic-optimisic’ (PO) methods for the SLB problem,222and also safe MDPs (e.g. Turchetta et al., 2016; Wachi and Sui, 2020; Bernasconi et al., 2022; Vaswani et al., 2022) which operate by exploring in the vicinity of x𝗌superscript𝑥𝗌x^{\mathsf{s}}italic_x start_POSTSUPERSCRIPT sansserif_s end_POSTSUPERSCRIPT, and build pessimistic estimates of 𝒮𝒮\mathcal{S}caligraphic_S, over which they act optimistically. While such methods attain the strong safety guarantee of 𝒮T=0subscript𝒮𝑇0\mathscr{S}_{T}=0script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0 whp, the associated costs are significant: (i)i\mathrm{(i)}( roman_i ) the knowledge of (x𝗌,M𝗌)superscript𝑥𝗌superscript𝑀𝗌(x^{\mathsf{s}},M^{\mathsf{s}})( italic_x start_POSTSUPERSCRIPT sansserif_s end_POSTSUPERSCRIPT , italic_M start_POSTSUPERSCRIPT sansserif_s end_POSTSUPERSCRIPT ) is nontrivial to obtain, and the costs of obtaining the same are not accounted for in this literature,333Note that the need for a safety margin may make even seemingly simple settings challenging. E.g., if x𝑥xitalic_x is the amount of different drugs assigned to a treatment, one may think that the ‘no-treatment’ drug cocktail x=0𝑥0x=0italic_x = 0 is always ‘safe’, and can serve as x𝗌superscript𝑥𝗌x^{\mathsf{s}}italic_x start_POSTSUPERSCRIPT sansserif_s end_POSTSUPERSCRIPT. However, in treatment regimens, it is common that any dose of compound 1111 must be accompanied by a proportional dose of compound 2222 to manage the side-effects induced by compount 1111, i.e, the constraint may be of the form (a1,a2),x0,subscript𝑎1subscript𝑎2𝑥0\langle(a_{1},-a_{2}),x\rangle\leq 0,⟨ ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_x ⟩ ≤ 0 , in which case x=0𝑥0x=0italic_x = 0 has no safety margin, and so is unusable for PO methods. and (ii)ii\mathrm{(ii)}( roman_ii ) the resulting efficacy bounds, T=O(dT/M𝗌)subscript𝑇𝑂𝑑𝑇superscript𝑀𝗌\mathscr{E}_{T}=O(d\sqrt{T}/M^{\mathsf{s}})script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_O ( italic_d square-root start_ARG italic_T end_ARG / italic_M start_POSTSUPERSCRIPT sansserif_s end_POSTSUPERSCRIPT ), quantitatively depend on this safety margin.444We also include a simulation study in §8 that indicates that the safety violations of doss are considerably better behaved than the efficacy costs of the PO method safe-LTS (Moradipari et al., 2021).

Aggregate Enforcement.

Instead of roundwise metrics, aggregate constraint enforcement aims to control T=θ,xxtsubscript𝑇𝜃superscript𝑥subscript𝑥𝑡\mathscr{R}_{T}=\sum\langle\theta,x^{*}-x_{t}\ranglescript_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ∑ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ and 𝒜T=tTmaxi(ai,xtαi)subscript𝒜𝑇subscript𝑡𝑇subscript𝑖superscript𝑎𝑖subscript𝑥𝑡subscript𝛼𝑖\mathscr{A}_{T}=\sum_{t\leq T}\max_{i}(\langle a^{i},x_{t}\rangle-\alpha_{i})script_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (e.g. Badanidiyuru et al., 2013, 2014; Agrawal and Devanur, 2014, 2016; Agrawal et al., 2016). The key difference from the roundwise setting is that there is no nonlinearity in the roundwise penalties in T,𝒜Tsubscript𝑇subscript𝒜𝑇\mathscr{R}_{T},\mathscr{A}_{T}script_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. This small change drastically affects allowable behaviour for such problems, e.g., we can ensure 𝒜T=o(T)subscript𝒜𝑇𝑜𝑇\mathscr{A}_{T}=o(T)script_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_o ( italic_T ) while alternating between playing ‘very unsafe’ and ‘very safe’ actions, since the negative costs of the latter cancel the positive costs of the former, but this would instead incur 𝒮T=Ω(T)subscript𝒮𝑇Ω𝑇\mathscr{S}_{T}=\Omega(T)script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = roman_Ω ( italic_T ). Of course, 𝒜Tsubscript𝒜𝑇\mathscr{A}_{T}script_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is an inappropriate metric for safety contexts, e.g., treating one patient unsafely cannot be balanced by assigning a placebo to the next. We note that while the analysis of Agrawal and Devanur (2014) can be extended to show (T,𝒮T)=O~(T)subscript𝑇subscript𝒮𝑇~𝑂𝑇(\mathscr{E}_{T},\mathscr{S}_{T})=\widetilde{O}(\sqrt{T})( script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ), we go much beyond this basic observation through our the finer grained upper bounds of (log2T,O~(T)(\log^{2}T,\widetilde{O}(\sqrt{T})( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T , over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ), as well as our instance-wise obstructions, which are both novel. We also note that most of the literature on aggregate enforcement explicitly assumes that x=0𝑥0x=0italic_x = 0 is safe, and that the entries of A𝐴Aitalic_A are positive, which we do not need. Aggregate enforcement remains an active area of research, e.g., ‘Conservative bandits’ (e.g. Wu et al., 2016) enforce properties of the form 𝒜t=O(t)subscript𝒜𝑡𝑂𝑡\mathscr{A}_{t}=O(\sqrt{t})script_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_O ( square-root start_ARG italic_t end_ARG ) for most t𝑡titalic_t, and Liu et al. (2021) show that given a Slater parameter, one can enforce 𝒜t0subscript𝒜𝑡0\mathscr{A}_{t}\leq 0script_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ 0 for all t𝑡titalic_t large enough. We also note that most work on constrained MDPs is of this flavour (e.g. Vaswani et al., 2022, and references therein).

2 Problem Setup

For naturals ab𝑎𝑏a\leq bitalic_a ≤ italic_b, let [a:b]:={a,,b}.[a:b]:=\{a,\dots,b\}.[ italic_a : italic_b ] := { italic_a , … , italic_b } . ,\langle\cdot,\cdot\rangle⟨ ⋅ , ⋅ ⟩ and \|\cdot\|∥ ⋅ ∥ denote the inner-product and 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT respectively, and for a matrix V0succeeds𝑉0V\succ 0italic_V ≻ 0, zV:=z,Vzassignsubscriptnorm𝑧𝑉𝑧𝑉𝑧\|z\|_{V}:=\sqrt{\langle z,Vz\rangle}∥ italic_z ∥ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT := square-root start_ARG ⟨ italic_z , italic_V italic_z ⟩ end_ARG. For a p×q𝑝𝑞p\times qitalic_p × italic_q matrix M𝑀Mitalic_M, and a set 𝖲[1:p],\mathsf{S}\subset[1:p],sansserif_S ⊂ [ 1 : italic_p ] , M(𝖲)𝑀𝖲M(\mathsf{S})italic_M ( sansserif_S ) denotes the |𝖲|×q𝖲𝑞|\mathsf{S}|\times q| sansserif_S | × italic_q submatrix of M𝑀Mitalic_M preserving rows indexed in 𝖲𝖲\mathsf{S}sansserif_S.

Setting.

An instance of polytopal SLB problem is parameterised by an a polytopal region 𝒳={Bxβ}d𝒳𝐵𝑥𝛽superscript𝑑\mathcal{X}=\{Bx\leq\beta\}\subset\mathbb{R}^{d}caligraphic_X = { italic_B italic_x ≤ italic_β } ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, a known constraint level vector αU,𝛼superscript𝑈\alpha\in\mathbb{R}^{U},italic_α ∈ blackboard_R start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT , and latent objective θd𝜃superscript𝑑\theta\in\mathbb{R}^{d}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and constraint matrix AU×d,𝐴superscript𝑈𝑑A\in\mathbb{R}^{U\times d},italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_U × italic_d end_POSTSUPERSCRIPT , which define the principal LP of relevance. Here, the constraints {Bxβ}𝐵𝑥𝛽\{Bx\leq\beta\}{ italic_B italic_x ≤ italic_β } should be thought of as arising from pre-determined hard limits on x𝑥xitalic_x.555e.g., known box constraints in crowdsourcing account for maximum worker capacity, and nonnegativity of work. For notational succinctness, we will embed these constraints into (A,α)𝐴𝛼(A,\alpha)( italic_A , italic_α ) by extending A𝐴Aitalic_A to lie in m×dsuperscript𝑚𝑑\mathbb{R}^{m\times d}blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT for m=U+K,𝑚𝑈𝐾m=U+K,italic_m = italic_U + italic_K , and setting the last K𝐾Kitalic_K rows of A𝐴Aitalic_A to B𝐵Bitalic_B, and similarly augment α𝛼\alphaitalic_α to include β𝛽\betaitalic_β. We shall often need the notation 𝟏U=(1,,1,0,,0)subscript1𝑈1100\mathbf{1}_{U}=(1,\cdots,1,0,\cdots,0)bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT = ( 1 , ⋯ , 1 , 0 , ⋯ , 0 ), with U𝑈Uitalic_U ones, which indicates the unknown constraints. With this notation, the principal LP of interest is maxx𝒳θ,x:Axα.:subscript𝑥𝒳𝜃𝑥𝐴𝑥𝛼\max_{x\in\mathcal{X}}\langle\theta,x\rangle:Ax\leq\alpha.roman_max start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ⟨ italic_θ , italic_x ⟩ : italic_A italic_x ≤ italic_α .

Play.

The problem proceeds in rounds, indexed by t𝑡titalic_t. For each t𝑡titalic_t, we choose an xt𝒳,subscript𝑥𝑡𝒳x_{t}\in\mathcal{X},italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X , and receive reward feedback rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and safety feedback {sti}i[1:U]subscriptsuperscriptsubscript𝑠𝑡𝑖𝑖delimited-[]:1𝑈\{s_{t}^{i}\}_{i\in[1:U]}{ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ 1 : italic_U ] end_POSTSUBSCRIPT that satisfy rt=θ,xt+wtRsubscript𝑟𝑡𝜃subscript𝑥𝑡superscriptsubscript𝑤𝑡𝑅r_{t}=\langle\theta,x_{t}\rangle+w_{t}^{R}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT and sti=ai,xt+wtS,i,superscriptsubscript𝑠𝑡𝑖superscript𝑎𝑖subscript𝑥𝑡superscriptsubscript𝑤𝑡𝑆𝑖\smash{s_{t}^{i}=\langle a^{i},x_{t}\rangle+w_{t}^{S,i}},italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S , italic_i end_POSTSUPERSCRIPT , where the various wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTs are each subGaussian noise processes, which need not be independent across i𝑖iitalic_i. The information set of the learner at time t𝑡titalic_t is t1:={(xτ,rτ,{sτi}i[1:U])τ<t},assignsubscript𝑡1subscriptsubscript𝑥𝜏subscript𝑟𝜏subscriptsubscriptsuperscript𝑠𝑖𝜏𝑖delimited-[]:1𝑈𝜏𝑡\mathscr{H}_{t-1}:=\{(x_{\tau},r_{\tau},\{s^{i}_{\tau}\}_{i\in[1:U]})_{\tau<t}\},script_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT := { ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , { italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ 1 : italic_U ] end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_τ < italic_t end_POSTSUBSCRIPT } , and xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must be adapted to the filtration induced by t1subscript𝑡1\mathscr{H}_{t-1}script_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT.

Metrics.

We will control the Efficacy Regret and Net Safety Violation (1). We reiterate that these have pertinence to the SLB setting because they penalise only the positive parts of roundwise costs.

Assumptions.

We conclude by noting standard assumptions due to Abbasi-Yadkori et al. (2011).

  1. 1.

    Boundedness: θ1,ai1formulae-sequencenorm𝜃1normsuperscript𝑎𝑖1\|\theta\|\leq 1,\|a^{i}\|\leq 1∥ italic_θ ∥ ≤ 1 , ∥ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ ≤ 1 for all i𝑖iitalic_i, and 𝒳{x1}𝒳norm𝑥1\mathcal{X}\subset\{\|x\|\leq 1\}caligraphic_X ⊂ { ∥ italic_x ∥ ≤ 1 } is a bounded polytope.

  2. 2.

    SubGaussian Noise: t,for-all𝑡\forall t,∀ italic_t , wt:=(wtR,{wtS,i}i[1:U])assignsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝑅subscriptsuperscriptsubscript𝑤𝑡𝑆𝑖𝑖delimited-[]:1𝑈w_{t}:=(w_{t}^{R},\{w_{t}^{S,i}\}_{i\in[1:U]})italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , { italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S , italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ 1 : italic_U ] end_POSTSUBSCRIPT ) is conditionally centred and 1111-subGaussian given t:=σ(t1,xt),assignsubscript𝑡𝜎subscript𝑡1subscript𝑥𝑡\mathcal{F}_{t}:=\sigma(\mathcal{H}_{t-1},x_{t}),caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := italic_σ ( caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , i.e., t,𝔼[wt|t]=0,λ,𝔼[exp(λwt)|t]exp(λ2/2).formulae-sequencefor-all𝑡𝔼delimited-[]conditionalsubscript𝑤𝑡subscript𝑡0for-all𝜆𝔼delimited-[]conditionalsuperscript𝜆topsubscript𝑤𝑡subscript𝑡superscriptnorm𝜆22\forall t,\mathbb{E}[w_{t}|\mathcal{F}_{t}]=0,\forall\lambda,\mathbb{E}[\exp(% \lambda^{\top}w_{t})|\mathcal{F}_{t}]\leq\exp(\|\lambda\|^{2}/2).∀ italic_t , blackboard_E [ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = 0 , ∀ italic_λ , blackboard_E [ roman_exp ( italic_λ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ≤ roman_exp ( ∥ italic_λ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) .

All subsequent results should be taken to hold only under the above. See §B.1 for more details.

3 A Doubly Optimistic Algorithm for Safe Linear Bandits

As previously discussed, our main method of interest is the natural approach of playing optimistically from an optimistic permissible set (Agrawal and Devanur, 2014; Efroni et al., 2020; Chen et al., 2022). We summarise the method, and establish key notation that is used throughout.

3.1 Confidence Sets and Noise Scales

We take the standard approach (Abbasi-Yadkori et al., 2011). Let the matrix X1:t=[x1,,xt]subscript𝑋:1𝑡superscriptsubscript𝑥1subscript𝑥𝑡topX_{1:t}=[x_{1},\dots,x_{t}]^{\top}italic_X start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and the vectors R1:t=[r1,,rt],S1:ti=[s1i,,sti]formulae-sequencesubscript𝑅:1𝑡superscriptsubscript𝑟1subscript𝑟𝑡topsubscriptsuperscript𝑆𝑖:1𝑡superscriptsubscriptsuperscript𝑠𝑖1subscriptsuperscript𝑠𝑖𝑡topR_{1:t}=[r_{1},\dots,r_{t}]^{\top},S^{i}_{1:t}=[s^{i}_{1},\dots,s^{i}_{t}]^{\top}italic_R start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT = [ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT = [ italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT arise by stacking the actions and feedback. The 1-regularised least squares (RLS) estimate of θ,ai𝜃superscript𝑎𝑖\theta,a^{i}italic_θ , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT using t1subscript𝑡1\mathscr{H}_{t-1}script_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT is

θ^t=(X1:tX1:t+λI)1X1:tR1:t,a^ti=(X1:tX1:t+λI)1X1:tS1:ti..formulae-sequencesubscript^𝜃𝑡superscriptsuperscriptsubscript𝑋:1𝑡topsubscript𝑋:1𝑡𝜆𝐼1superscriptsubscript𝑋:1𝑡topsubscript𝑅:1𝑡superscriptsubscript^𝑎𝑡𝑖superscriptsuperscriptsubscript𝑋:1𝑡topsubscript𝑋:1𝑡𝜆𝐼1superscriptsubscript𝑋:1𝑡topsuperscriptsubscript𝑆:1𝑡𝑖\hat{\theta}_{t}=(X_{1:t}^{\top}X_{1:t}+\lambda I)^{-1}X_{1:t}^{\top}R_{1:t},% \quad\hat{a}_{t}^{i}=(X_{1:t}^{\top}X_{1:t}+\lambda I)^{-1}X_{1:t}^{\top}S_{1:% t}^{i}..over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT + italic_λ italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( italic_X start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT + italic_λ italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT . .

Of course, if i[U+1:m],i\in[U+1:m],italic_i ∈ [ italic_U + 1 : italic_m ] , then we do not need to estimate i𝑖iitalic_i, and we shall just set a^ti=aisuperscriptsubscript^𝑎𝑡𝑖superscript𝑎𝑖\hat{a}_{t}^{i}=a^{i}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT instead. We will collate the a^tisubscriptsuperscript^𝑎𝑖𝑡\hat{a}^{i}_{t}over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTs into a matrix A^tsubscript^𝐴𝑡\hat{A}_{t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT row-wise. Let us define the signal strength as Vt:=stxsxs+I,assignsubscript𝑉𝑡subscript𝑠𝑡subscript𝑥𝑠superscriptsubscript𝑥𝑠top𝐼V_{t}:=\sum_{s\leq t}x_{s}x_{s}^{\top}+I,italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_s ≤ italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_I , and for δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), the m𝑚mitalic_m-confidence radius as ωt(δ)=1+12log(U+1)detVt1δ.subscript𝜔𝑡𝛿112𝑈1subscript𝑉𝑡1𝛿\sqrt{\omega_{t}(\delta)}=1+\sqrt{\frac{1}{2}\log\frac{(U+1)\sqrt{\det V_{t-1}% }}{\delta}}.square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) end_ARG = 1 + square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log divide start_ARG ( italic_U + 1 ) square-root start_ARG roman_det italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_δ end_ARG end_ARG . The main results are based on the following two concepts, which we explicitly delineate.

Definition 3.1.

For any time t𝑡titalic_t, the RLS confidence sets are

𝒞tθ(δ):={θ~:θ~θ^tVt1ωt(δ)} and 𝓒t(δ):={A~: rows i,a~ia^tiVt1ωt(δ)𝟏U},assignsuperscriptsubscript𝒞𝑡𝜃𝛿conditional-set~𝜃subscriptnorm~𝜃subscript^𝜃𝑡subscript𝑉𝑡1subscript𝜔𝑡𝛿 and subscript𝓒𝑡𝛿assignconditional-set~𝐴for-all rows 𝑖subscriptnormsuperscript~𝑎𝑖superscriptsubscript^𝑎𝑡𝑖subscript𝑉𝑡1subscript𝜔𝑡𝛿subscript1𝑈\mathcal{C}_{t}^{\theta}(\delta):=\{\tilde{\theta}:\|\tilde{\theta}-\hat{% \theta}_{t}\|_{V_{t-1}}\leq\sqrt{\omega_{t}(\delta)}\}\text{ and }\boldsymbol{% \mathcal{{C}}}_{t}(\delta):=\{\tilde{A}:\forall\textrm{ rows }i,\|\tilde{a}^{i% }-\hat{a}_{t}^{i}\|_{V_{t-1}}\leq\sqrt{\omega_{t}(\delta)}\mathbf{1}_{U}\},caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_δ ) := { over~ start_ARG italic_θ end_ARG : ∥ over~ start_ARG italic_θ end_ARG - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) end_ARG } and bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) := { over~ start_ARG italic_A end_ARG : ∀ rows italic_i , ∥ over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) end_ARG bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT } ,

and the local noise scale is ρt(x;δ):=2ωt(δ)xVt11.assignsubscript𝜌𝑡𝑥𝛿2subscript𝜔𝑡𝛿subscriptnorm𝑥superscriptsubscript𝑉𝑡11\rho_{t}(x;\delta):=2\sqrt{\omega_{t}(\delta)}\|x\|_{V_{t-1}^{-1}}.italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ; italic_δ ) := 2 square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) end_ARG ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT .

The key properties we need are due to Abbasi-Yadkori et al. (2011), and are summarised below, and proved in §B.2. We will often drop the dependence of 𝒞ti(δ),𝓒t(δ),superscriptsubscript𝒞𝑡𝑖𝛿subscript𝓒𝑡𝛿\mathcal{C}_{t}^{i}(\delta),\boldsymbol{\mathcal{{C}}}_{t}(\delta),caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_δ ) , bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) , and ρt(x;δ)subscript𝜌𝑡𝑥𝛿\rho_{t}(x;\delta)italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ; italic_δ ) on δ𝛿\deltaitalic_δ.

Lemma 3.2.

The confidence sets are consistent, i.e., (t,θ𝒞tθ(δ),A𝓒t(δ))1δ.\mathbb{P}\left(\forall t,\theta\in\mathcal{C}_{t}^{\theta}(\delta),A\in% \boldsymbol{\mathcal{{C}}}_{t}(\delta)\right)\geq 1-\delta.blackboard_P ( ∀ italic_t , italic_θ ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_δ ) , italic_A ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) ) ≥ 1 - italic_δ . Further, under consistency, the noise scale ρt(x;δ)subscript𝜌𝑡𝑥𝛿\rho_{t}(x;\delta)italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ; italic_δ ) at any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X satisfies x𝒳,for-all𝑥𝒳\forall x\in\mathcal{X},∀ italic_x ∈ caligraphic_X ,

A~𝓒t(δ),|(A~A)x|ρt(x;δ)𝟏U,andθ~𝒞tθ(δ),|θ~θ,x|ρt(x;δ).formulae-sequencefor-all~𝐴subscript𝓒𝑡𝛿formulae-sequence~𝐴𝐴𝑥subscript𝜌𝑡𝑥𝛿subscript1𝑈andformulae-sequencefor-all~𝜃superscriptsubscript𝒞𝑡𝜃𝛿~𝜃𝜃𝑥subscript𝜌𝑡𝑥𝛿\forall\tilde{A}\in\boldsymbol{\mathcal{{C}}}_{t}(\delta),|(\tilde{A}-A)x|\leq% \rho_{t}(x;\delta)\mathbf{1}_{U},\quad\textrm{and}\quad\forall\tilde{\theta}% \in\mathcal{C}_{t}^{\theta}(\delta),|\langle\tilde{\theta}-\theta,x\rangle|% \leq\rho_{t}(x;\delta).∀ over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) , | ( over~ start_ARG italic_A end_ARG - italic_A ) italic_x | ≤ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ; italic_δ ) bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT , and ∀ over~ start_ARG italic_θ end_ARG ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_δ ) , | ⟨ over~ start_ARG italic_θ end_ARG - italic_θ , italic_x ⟩ | ≤ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ; italic_δ ) .

Finally, for any sequence {xt}subscript𝑥𝑡\{x_{t}\}{ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, stρs(xs)2=O(d2log2t)subscript𝑠𝑡subscript𝜌𝑠superscriptsubscript𝑥𝑠2𝑂superscript𝑑2superscript2𝑡\sum_{s\leq t}\rho_{s}(x_{s})^{2}=O(d^{2}\log^{2}t)∑ start_POSTSUBSCRIPT italic_s ≤ italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t ) and stρs(xs)=O~(d2t).subscript𝑠𝑡subscript𝜌𝑠subscript𝑥𝑠~𝑂superscript𝑑2𝑡\sum_{s\leq t}\rho_{s}(x_{s})=\widetilde{O}(\sqrt{d^{2}t}).∑ start_POSTSUBSCRIPT italic_s ≤ italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t end_ARG ) .

3.2 Doubly-Optimistic Safe Selection

We describe the method, doss (Algorithm 1). The key construction herein is the optimistic permissible set of points x𝑥xitalic_x that are safe according to at least one choice of constraints in 𝓒tsubscript𝓒𝑡\boldsymbol{\mathcal{{C}}}_{t}bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

𝒮~t(δ):={x:A~𝓒t(δ) s.t. A~xα}.assignsubscript~𝒮𝑡𝛿conditional-set𝑥~𝐴subscript𝓒𝑡𝛿 s.t. ~𝐴𝑥𝛼\widetilde{\mathcal{S}}_{t}(\delta):=\{x:\exists\tilde{A}\in\boldsymbol{% \mathcal{{C}}}_{t}(\delta)\textrm{ s.t. }\tilde{A}x\leq\alpha\}.over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) := { italic_x : ∃ over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) s.t. over~ start_ARG italic_A end_ARG italic_x ≤ italic_α } . (2)
Algorithm 1 Doubly-Optimistic Safe Selection (doss) (δ𝛿\deltaitalic_δ)
  Input: δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 )
  for t=1,2,𝑡12t=1,2,\cdotsitalic_t = 1 , 2 , ⋯ do
     Construct 𝒮~t(δ)subscript~𝒮𝑡𝛿\widetilde{\mathcal{S}}_{t}(\delta)over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) as in (2).
     Optimize (3) and play xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
     Observe rt,xtsubscript𝑟𝑡subscript𝑥𝑡r_{t,x_{t}}italic_r start_POSTSUBSCRIPT italic_t , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, {st,xti}superscriptsubscript𝑠𝑡subscript𝑥𝑡𝑖\{s_{t,x_{t}}^{i}\}{ italic_s start_POSTSUBSCRIPT italic_t , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT }
     Update X,R,{Si},V,C𝑋𝑅superscript𝑆𝑖𝑉𝐶X,R,\{S^{i}\},V,Citalic_X , italic_R , { italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } , italic_V , italic_C
  end for

The set 𝒮~tsubscript~𝒮𝑡\widetilde{\mathcal{S}}_{t}over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT consists of all actions that may plausibly be safe given tsubscript𝑡\mathscr{H}_{t}script_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The arm xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is selected optimistically from 𝒮~tsubscript~𝒮𝑡\widetilde{\mathcal{S}}_{t}over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as

(θ~t,xt)missingargmax{θ~,x:θ~𝒞tθ(δ),x𝒮~t(δ)}subscript~𝜃𝑡subscript𝑥𝑡missing𝑎𝑟𝑔𝑚𝑎𝑥conditional-set~𝜃𝑥formulae-sequence~𝜃superscriptsubscript𝒞𝑡𝜃𝛿𝑥subscript~𝒮𝑡𝛿(\tilde{\theta}_{t},x_{t})\in\mathop{\mathrm{missing}}{arg\,max}\{\langle% \tilde{\theta},x\rangle:{\tilde{\theta}\in\mathcal{C}_{t}^{\theta}(\delta),x% \in\widetilde{\mathcal{S}}_{t}(\delta)}\}( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ roman_missing italic_a italic_r italic_g italic_m italic_a italic_x { ⟨ over~ start_ARG italic_θ end_ARG , italic_x ⟩ : over~ start_ARG italic_θ end_ARG ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_δ ) , italic_x ∈ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) } (3)

The optimistic construction of the permissible set is the main distinction between the DO and PO approaches (§1), which instead work with the pessimistic set Πt:={x:A~𝓒t,A~xα}𝒮assignsubscriptΠ𝑡conditional-set𝑥formulae-sequencefor-all~𝐴subscript𝓒𝑡~𝐴𝑥𝛼𝒮\Pi_{t}:=\{x:\forall\tilde{A}\in\boldsymbol{\mathcal{{C}}}_{t},\tilde{A}x\leq% \alpha\}\subset\mathcal{S}roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := { italic_x : ∀ over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_A end_ARG italic_x ≤ italic_α } ⊂ caligraphic_S whp. Instead, 𝒮~t(δ)𝒮𝒮subscript~𝒮𝑡𝛿\widetilde{\mathcal{S}}_{t}(\delta)\supset\mathcal{S}over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) ⊃ caligraphic_S whp. Of course, since the known constraints in A𝐴Aitalic_A are enforced, 𝒮~t(δ)𝒳.subscript~𝒮𝑡𝛿𝒳\widetilde{\mathcal{S}}_{t}(\delta)\subset\mathcal{X}.over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) ⊂ caligraphic_X .

4 Warm Up: Polynomial Bounds on Regret and Safety Cost, and Going Beyond

An immediate application of the approach of Abbasi-Yadkori et al. (2011) yields the following basic result, establishing that doss is a reasonable procedure.

Theorem 4.1.

The actions {xt}subscript𝑥𝑡\{x_{t}\}{ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } of doss(δ𝛿\deltaitalic_δ) yield, whp, T=O~(d2T)subscript𝑇~𝑂superscript𝑑2𝑇\mathscr{E}_{T}=\widetilde{O}(\sqrt{d^{2}T})script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG ), and 𝒮T=O~(d2T).subscript𝒮𝑇~𝑂superscript𝑑2𝑇\mathscr{S}_{T}=\widetilde{O}(\sqrt{d^{2}T}).script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG ) .

Proof Sketch. By Lemma 3.2, t,θ𝒞tθ,A𝓒tformulae-sequencefor-all𝑡𝜃superscriptsubscript𝒞𝑡𝜃𝐴subscript𝓒𝑡\forall t,\theta\in\mathcal{C}_{t}^{\theta},A\in\boldsymbol{\mathcal{{C}}}_{t}∀ italic_t , italic_θ ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT , italic_A ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT whp, and so x𝒮~t(δ)superscript𝑥subscript~𝒮𝑡𝛿x^{*}\in\widetilde{\mathcal{S}}_{t}(\delta)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) whp. Thus, (3) ensures θ~,xtθ,x.~𝜃subscript𝑥𝑡𝜃superscript𝑥\langle\tilde{\theta},x_{t}\rangle\geq\langle\theta,x^{*}\rangle.⟨ over~ start_ARG italic_θ end_ARG , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ . But, by the noise-scale characterisation in Lemma 3.2, θ~,xtθ,xt+ρt(xt),~𝜃subscript𝑥𝑡𝜃subscript𝑥𝑡subscript𝜌𝑡subscript𝑥𝑡\langle\tilde{\theta},x_{t}\rangle\leq\langle\theta,x_{t}\rangle+\rho_{t}(x_{t% }),⟨ over~ start_ARG italic_θ end_ARG , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≤ ⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , and so θ,xxtρt(xt)𝜃superscript𝑥subscript𝑥𝑡subscript𝜌𝑡subscript𝑥𝑡\langle\theta,x^{*}-x_{t}\rangle\leq\rho_{t}(x_{t})⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≤ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). On the other hand, since xt𝒮~t,subscript𝑥𝑡subscript~𝒮𝑡x_{t}\in\widetilde{\mathcal{S}}_{t},italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , there exists some A~𝒮~t:A~xtα:~𝐴subscript~𝒮𝑡~𝐴subscript𝑥𝑡𝛼\tilde{A}\in\widetilde{\mathcal{S}}_{t}:\tilde{A}x_{t}\leq\alphaover~ start_ARG italic_A end_ARG ∈ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : over~ start_ARG italic_A end_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_α. But again αA~xtAxtρt(xt)𝟏U,𝛼~𝐴subscript𝑥𝑡𝐴subscript𝑥𝑡subscript𝜌𝑡subscript𝑥𝑡subscript1𝑈\alpha\geq\tilde{A}x_{t}\geq Ax_{t}-\rho_{t}(x_{t})\mathbf{1}_{U},italic_α ≥ over~ start_ARG italic_A end_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_A italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT , and so maxi(ai,xαi)+ρt(xt)\max_{i}(\langle a^{i},x\rangle-\alpha^{i})_{+}\leq\rho_{t}(x_{t})roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ≤ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) Consequently, TtTρt(xt),subscript𝑇subscript𝑡𝑇subscript𝜌𝑡subscript𝑥𝑡\mathscr{E}_{T}\leq\sum_{t\leq T}\rho_{t}(x_{t}),script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , and 𝒮TtTρt(xt),subscript𝒮𝑇subscript𝑡𝑇subscript𝜌𝑡subscript𝑥𝑡\mathscr{S}_{T}\leq\sum_{t\leq T}\rho_{t}(x_{t}),script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , and the bound follows from Lemma 3.2.

Polytopes to Break Through T𝑇\sqrt{T}square-root start_ARG italic_T end_ARG?

The above result holds in fact holds over any convex domain without change. However, our domain of interest is linear programming, i.e., 𝒮𝒮\mathcal{S}caligraphic_S and 𝒳𝒳\mathcal{X}caligraphic_X are polytopes, and thus is much more structured. Indeed, for linear bandits with known 𝒮𝒮\mathcal{S}caligraphic_S, optimistic play yields instance-dependent logarithmic regret bounds for large T𝑇Titalic_T (Abbasi-Yadkori et al., 2011). Such results rely on the observation that if 𝒮𝒮\mathcal{S}caligraphic_S is known, any action that an optimistic method takes lies in the finite set of extreme points of 𝒮.𝒮\mathcal{S}.caligraphic_S . Therefore, Δ>0Δ0\exists\Delta>0∃ roman_Δ > 0 such that for any suboptimal xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, θ,xxtΔ𝜃superscript𝑥subscript𝑥𝑡Δ\langle\theta,x^{*}-x_{t}\rangle\geq\Delta⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ roman_Δ, and which directly leads to regret bounds of O(log2(T)/Δ)𝑂superscript2𝑇ΔO(\log^{2}(T)/\Delta)italic_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_T ) / roman_Δ ).666The key trick is that Tρt(xt)𝟏{ρt(xt)Δ}ρt(xt)2/Δ.subscript𝑇subscript𝜌𝑡subscript𝑥𝑡1subscript𝜌𝑡subscript𝑥𝑡Δsubscript𝜌𝑡superscriptsubscript𝑥𝑡2Δ\mathscr{R}_{T}\leq\sum\rho_{t}(x_{t})\mathbf{1}\{\rho_{t}(x_{t})\geq\Delta\}% \leq\sum\rho_{t}(x_{t})^{2}/\Delta.script_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≤ ∑ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_1 { italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ roman_Δ } ≤ ∑ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / roman_Δ .

This raises the natural question: can we also attain logarithmic bounds on (T,𝒮T)subscript𝑇subscript𝒮𝑇(\mathscr{E}_{T},\mathscr{S}_{T})( script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) when some of the constraints are unknown? Answering this will occupy us for the remainder of this paper.

5 Impossibility of Simultaneous Logarithmic Bounds on Both Efficacy and Safety

The question we raised in §4 needs a little care to formulate: since we do not know 𝒮𝒮\mathcal{S}caligraphic_S, it is unreasonable to expect bounds that scale only with the optimality gap of actions, since unsafe points outside of 𝒮𝒮\mathcal{S}caligraphic_S must also be eliminated. We can account for this by also considering the spurious extreme points induced by the bounding polytope 𝒳,𝒳\mathcal{X},caligraphic_X , and consider

:={extreme points of 𝒮}{extreme points of 𝒳}.assignextreme points of 𝒮extreme points of 𝒳\displaystyle\mathcal{E}:=\{\textrm{extreme points of }\mathcal{S}\}\cup\{% \textrm{extreme points of }\mathcal{X}\}.caligraphic_E := { extreme points of caligraphic_S } ∪ { extreme points of caligraphic_X } .

Now, \mathcal{E}caligraphic_E is again a finite set, and for any x{x},𝑥superscript𝑥x\in\mathcal{E}\setminus\{x^{*}\},italic_x ∈ caligraphic_E ∖ { italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } , either x𝑥xitalic_x is feasible but suboptimal, in which case θ,xx>0𝜃superscript𝑥𝑥0\langle\theta,x^{*}-x\rangle>0⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x ⟩ > 0 or it is infeasible, in which case maxi(ai,xαi)>0.subscript𝑖superscript𝑎𝑖𝑥superscript𝛼𝑖0\max_{i}(\langle a^{i},x\rangle-\alpha^{i})>0.roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) > 0 . Let us say that an instance is ΔΔ\Deltaroman_Δ-well separated if the smallest such lower bound is at least ΔΔ\Deltaroman_Δ. Then note that if we knew ,\mathcal{E},caligraphic_E , then it is easy to obtain O(Δ1log2T)𝑂superscriptΔ1superscript2𝑇O(\Delta^{-1}\log^{2}T)italic_O ( roman_Δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) bounds using the technique described in §4. The refined question of interest is: can we always attain logarithmic efficacy regret and safety violations for well-separated SLB instances? Surprisingly, the answer to this is negative, as we show in §F.1.

Theorem 5.1.

For every SLB algorithm, there exists a 1/8181/81 / 8-well-separated instance on which the algorithm must incur max(𝔼[T],𝔼[𝒮T])=Ω(T)𝔼delimited-[]subscript𝑇𝔼delimited-[]subscript𝒮𝑇Ω𝑇\max(\mathbb{E}[\mathscr{E}_{T}],\mathbb{E}[\mathscr{S}_{T}])=\Omega(\sqrt{T})roman_max ( blackboard_E [ script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] , blackboard_E [ script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] ) = roman_Ω ( square-root start_ARG italic_T end_ARG ).

Refer to caption
Figure 2: An obstruction to logarithmic bounds in safe linear bandits.

Proof Sketch. The obstruction is illustrated in Figure 2. We study the 1111D problem maxx𝑥\max xroman_max italic_x under the known constraints 0x1,0𝑥10\leq x\leq 1,0 ≤ italic_x ≤ 1 , reward parameter θ=1,𝜃1\theta=1,italic_θ = 1 , and the unknown constraint ax1/4𝑎𝑥14ax\leq 1/4italic_a italic_x ≤ 1 / 4. Consider the case a{(1±κ)/2}𝑎plus-or-minus1𝜅2a\in\{\nicefrac{{(1\pm\kappa)}}{{2}}\}italic_a ∈ { / start_ARG ( 1 ± italic_κ ) end_ARG start_ARG 2 end_ARG } for κ1/4𝜅14\kappa\leq\nicefrac{{1}}{{4}}italic_κ ≤ / start_ARG 1 end_ARG start_ARG 4 end_ARG. For these instances, ={0,1,1/2(1±κ)}0112plus-or-minus1𝜅\mathcal{E}=\{0,1,\nicefrac{{1}}{{2(1\pm\kappa)}}\}caligraphic_E = { 0 , 1 , / start_ARG 1 end_ARG start_ARG 2 ( 1 ± italic_κ ) end_ARG }, and the last point is optimal. Further, 00 is at least (2(1±κ))12/5superscript2plus-or-minus1𝜅125(2(1\pm\kappa))^{-1}\geq 2/5( 2 ( 1 ± italic_κ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ≥ 2 / 5-inefficient, and 1111 violates the constraint by (1±2κ)/418,plus-or-minus12𝜅418\nicefrac{{(1\pm 2\kappa)}}{{4}}\geq\frac{1}{8},/ start_ARG ( 1 ± 2 italic_κ ) end_ARG start_ARG 4 end_ARG ≥ divide start_ARG 1 end_ARG start_ARG 8 end_ARG , and so either instance is 1/8181/81 / 8-well-separated.

But, no matter the xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTs, we cannot estimate a𝑎aitalic_a to error better than 1/t,1𝑡\nicefrac{{1}}{{\sqrt{t}}},/ start_ARG 1 end_ARG start_ARG square-root start_ARG italic_t end_ARG end_ARG , and so we cannot eliminate either of 1±κ/2plus-or-minus1𝜅2\nicefrac{{1\pm\kappa}}{{2}}/ start_ARG 1 ± italic_κ end_ARG start_ARG 2 end_ARG if t<1κ2.𝑡1superscript𝜅2t<\frac{1}{\kappa^{2}}.italic_t < divide start_ARG 1 end_ARG start_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . Now, if the truth were a=(1κ)/2,𝑎1𝜅2a=\nicefrac{{(1-\kappa)}}{{2}},italic_a = / start_ARG ( 1 - italic_κ ) end_ARG start_ARG 2 end_ARG , playing xt<2/1κ2subscript𝑥𝑡21superscript𝜅2x_{t}<\nicefrac{{2}}{{1-\kappa^{2}}}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < / start_ARG 2 end_ARG start_ARG 1 - italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG incurs inefficacy 2κabsent2𝜅\geq 2\kappa≥ 2 italic_κ, and conversely if a=(1+κ)/2,𝑎1𝜅2a=\nicefrac{{(1+\kappa)}}{{2}},italic_a = / start_ARG ( 1 + italic_κ ) end_ARG start_ARG 2 end_ARG , playing xt2/1κ2subscript𝑥𝑡21superscript𝜅2x_{t}\geq\nicefrac{{2}}{{1-\kappa^{2}}}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ / start_ARG 2 end_ARG start_ARG 1 - italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG violates safety by 2κ2𝜅2\kappa2 italic_κ. Thus, at least one of T(1+κ)/2superscriptsubscript𝑇1𝜅2\mathscr{E}_{T}^{\nicefrac{{(1+\kappa)}}{{2}}}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT / start_ARG ( 1 + italic_κ ) end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT and 𝒮T(1κ)/2superscriptsubscript𝒮𝑇1𝜅2\mathscr{S}_{T}^{\nicefrac{{(1-\kappa)}}{{2}}}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT / start_ARG ( 1 - italic_κ ) end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT must be Ω(κmin(T,κ2)).Ω𝜅𝑇superscript𝜅2\Omega(\kappa\cdot\min(T,\kappa^{-2})).roman_Ω ( italic_κ ⋅ roman_min ( italic_T , italic_κ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) ) . The bound follows by choosing κ=1/T𝜅1𝑇\kappa=1/\sqrt{T}italic_κ = 1 / square-root start_ARG italic_T end_ARG. \Box

Impossibility of instance-dependent simultaneous logarithmic bounds.

We highlight that the above lower bound scales as T𝑇\sqrt{T}square-root start_ARG italic_T end_ARG despite constant order separation in the instance. This stands in sharp contrast to existing minimax lower bounds for standard bandits (e.g. Shamir, 2015), which set ΔT1/2similar-toΔsuperscript𝑇12\Delta\sim T^{-\nicefrac{{1}}{{2}}}roman_Δ ∼ italic_T start_POSTSUPERSCRIPT - / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT to show Ω(T)Ω𝑇\Omega(\sqrt{T})roman_Ω ( square-root start_ARG italic_T end_ARG ) bounds. The barrier to logarithmic control in SLBs is more fundamental, and comes from an inability to refine the precise location of the optimal point, rather than because there are suboptimal points in the noiseless problem that have small gaps. In other words, the issue is one of precision rather than one of hardness in the underlying LP, and this makes it impossible to be both very efficient and very safe on all instances. We further observe that the construction is extremely simple, and thus can embed into essentially any class of instances (e.g., by revealing a line that the optimum lies on), and so this issue is pervasive, rather than limited to specific hard cases.

Nevertheless, the result does not preclude that one of T,𝒮Tsubscript𝑇subscript𝒮𝑇\mathscr{E}_{T},\mathscr{S}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is small. In fact, although they need the extra information (x𝗌,M𝗌)superscript𝑥𝗌superscript𝑀𝗌(x^{\mathsf{s}},M^{\mathsf{s}})( italic_x start_POSTSUPERSCRIPT sansserif_s end_POSTSUPERSCRIPT , italic_M start_POSTSUPERSCRIPT sansserif_s end_POSTSUPERSCRIPT ), we can view PO schemes as saturating this bound, since they achieve T=O~(T),𝒮T=0.formulae-sequencesubscript𝑇~𝑂𝑇subscript𝒮𝑇0\mathscr{E}_{T}=\widetilde{O}(\sqrt{T}),\mathscr{S}_{T}=0.script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0 . We shall show in the subsequent that the DO method doss saturates the bound as well, attaining T=O(log2T),𝒮T=O~(T),formulae-sequencesubscript𝑇𝑂superscript2𝑇subscript𝒮𝑇~𝑂𝑇\mathscr{E}_{T}=O(\log^{2}T),\mathscr{S}_{T}=\widetilde{O}(\sqrt{T}),script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) , without this extra information.

A dual view, and our approach.

From an analytic point of view, the failure to improve on T𝑇\sqrt{T}square-root start_ARG italic_T end_ARG bounds can be seen as a breaking down of the assertion that in polytopal domains, optimistic methods play on the finite set of extreme points of the polytope. Indeed, in the SLB scenario, the polytope is not known, and these extreme points are effectively smeared out into sets of diameter Ω(t1/2)Ωsuperscript𝑡12\Omega(t^{-\nicefrac{{1}}{{2}}})roman_Ω ( italic_t start_POSTSUPERSCRIPT - / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) due to estimation errors in A^tsubscript^𝐴𝑡\hat{A}_{t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Thus, the primal approach to analysing polytopes breaks down.

As described in §1, our resolution to this issue lies in the dual view of extreme points of a polytope as points that activate exactly d𝑑ditalic_d independent constraints. Due to this, we can view optimism with known 𝒮𝒮\mathcal{S}caligraphic_S as activating some d𝑑ditalic_d constraints of {Axα}𝐴𝑥𝛼\{Ax\leq\alpha\}{ italic_A italic_x ≤ italic_α }. This view generalises: we show that there exists some A~𝓒t~𝐴subscript𝓒𝑡\tilde{A}\in\boldsymbol{\mathcal{{C}}}_{t}over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that under doss, xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT activates at least d𝑑ditalic_d constraints of 𝒮A~:={A~xα}.assignsubscript𝒮~𝐴~𝐴𝑥𝛼\mathcal{S}_{\tilde{A}}:=\{\tilde{A}x\leq\alpha\}.caligraphic_S start_POSTSUBSCRIPT over~ start_ARG italic_A end_ARG end_POSTSUBSCRIPT := { over~ start_ARG italic_A end_ARG italic_x ≤ italic_α } . Naturally, such a set of constraints is a ‘poor’ choice if saturating these constraints for {Axα}𝐴𝑥𝛼\{Ax\leq\alpha\}{ italic_A italic_x ≤ italic_α } yields poor or infeasible points. The key idea is that if I𝐼Iitalic_I is ‘poor’, then the only way doss would prefer to activate noisy versions of the constraints in I𝐼Iitalic_I is if the noise-scale ρt(xt)subscript𝜌𝑡subscript𝑥𝑡\rho_{t}(x_{t})italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is large.

This sets up a two-step attack to control Tsubscript𝑇\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. First, we use the dual argument above to study a ‘combinatorial identification’ question of whether doss finds the ‘right’ set of constraints to saturate. This is addressed by develo** new dual notions of gaps for sets of constraints, which arise by an approach reminiscent of the global sensitivity analysis of LPs (Bertsimas and Tsitsiklis, 1997, Ch.5), and is the subject of §6. Secondly, even if the ‘right’ set of constraints are activated, doss may play ineffectively due to noisy estimation of A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG. Standard arguments (such as §4) only yield a T𝑇\sqrt{T}square-root start_ARG italic_T end_ARG control on this. Instead, we show that due to the optimism of (3), if td𝑡𝑑t\geq ditalic_t ≥ italic_d then activating any ‘optimal’ set of constraints yields xt:θ,xtx>0:subscript𝑥𝑡𝜃subscript𝑥𝑡superscript𝑥0x_{t}:\langle\theta,x_{t}-x^{*}\rangle>0italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : ⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ > 0, which controls efficacy loss due to such play to O(1).𝑂1O(1).italic_O ( 1 ) . This argument is elementary, but involved, and entails a careful analysis of the structure of (3) when optimal constraints are activated, as developed in §7, and §E.1.

6 Structural Behaviour of doss, and Noise-Scale Lower Bounds

We proceed to formally define basic index sets, as well as the gaps associated with these index sets, which lead lead to the key consequence that doss does not play ‘suboptimal’ index sets too often.

6.1 Basic Index Sets

We begin by formalising ‘sets of constraints’, and ‘activation’ as mentioned in §5.

Definition 6.1.

An index set I𝐼Iitalic_I is a subset of [1:m].delimited-[]:1𝑚[1:m].[ 1 : italic_m ] . Such a set is I𝐼Iitalic_I is called a basic index set (BIS) if |I|=d𝐼𝑑|I|=d| italic_I | = italic_d. The set of points that activate an index set I𝐼Iitalic_I is defined as 𝒳I:={x𝒮:A(I)x=α(I)}.assignsuperscript𝒳𝐼conditional-set𝑥𝒮𝐴𝐼𝑥𝛼𝐼\mathcal{X}^{I}:=\{x\in\mathcal{S}:A(I)x=\alpha(I)\}.caligraphic_X start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT := { italic_x ∈ caligraphic_S : italic_A ( italic_I ) italic_x = italic_α ( italic_I ) } .

Notice that we demand that activating points are feasible, i.e., 𝒳I𝒮.superscript𝒳𝐼𝒮\mathcal{X}^{I}\subset\mathcal{S}.caligraphic_X start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ⊂ caligraphic_S . The set 𝒳Isuperscript𝒳𝐼\mathcal{X}^{I}caligraphic_X start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT may be empty, or a singleton, or an affine segment. We shall find the following linear-algebraic terminology useful.

Definition 6.2.

A BIS I𝐼Iitalic_I is called (i)i\mathrm{(i)}( roman_i ) feasible if 𝒳Isuperscript𝒳𝐼\mathcal{X}^{I}\neq\emptysetcaligraphic_X start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ≠ ∅ and infeasible otherwise; (ii)ii\mathrm{(ii)}( roman_ii ) suboptimal if x𝒳Isuperscript𝑥superscript𝒳𝐼x^{*}\not\in\mathcal{X}^{I}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∉ caligraphic_X start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT and optimal otherwise; (iii)iii\mathrm{(iii)}( roman_iii ) full rank if the row vectors of A(I)𝐴𝐼A(I)italic_A ( italic_I ) span dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

Refer to caption
Figure 3: Illustration of Ex. 6.3. The black lines represent the known constraints, the red line is the unknown constraint, and the blue line is the locus of optimality.

Example 6.3.

To illustrate these definitions, consider the LP

maxx1+2x2:x21/2unknown,x1x2,x11,x20known.:subscript𝑥12subscript𝑥2subscriptsubscript𝑥212unknownsubscriptformulae-sequencesubscript𝑥1subscript𝑥2formulae-sequencesubscript𝑥11subscript𝑥20known\displaystyle\max x_{1}+2x_{2}:\underbrace{x_{2}\leq 1/2}_{\textrm{unknown}},% \,\,\,\,\underbrace{x_{1}\geq x_{2},x_{1}\leq 1,x_{2}\geq 0}_{\textrm{known}}.roman_max italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2 italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : under⏟ start_ARG italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1 / 2 end_ARG start_POSTSUBSCRIPT unknown end_POSTSUBSCRIPT , under⏟ start_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 1 , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 0 end_ARG start_POSTSUBSCRIPT known end_POSTSUBSCRIPT .

Foregoing normalisation for clarity, we have m=4,U=1formulae-sequence𝑚4𝑈1m=4,U=1italic_m = 4 , italic_U = 1 and the parameters θ=(1,2),a1=(0,1),a2=(1,1),a3=(1,0),a4=(0,1),α=(0.5,0,1,0)formulae-sequence𝜃12formulae-sequencesuperscript𝑎101formulae-sequencesuperscript𝑎211formulae-sequencesuperscript𝑎310formulae-sequencesuperscript𝑎401𝛼0.5010\theta=(1,2),a^{1}=(0,1),a^{2}=(-1,1),a^{3}=(1,0),a^{4}=(0,-1),\alpha=(0.5,0,1% ,0)italic_θ = ( 1 , 2 ) , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = ( 0 , 1 ) , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( - 1 , 1 ) , italic_a start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT = ( 1 , 0 ) , italic_a start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = ( 0 , - 1 ) , italic_α = ( 0.5 , 0 , 1 , 0 ). There are (42)=6binomial426\binom{4}{2}=6( FRACOP start_ARG 4 end_ARG start_ARG 2 end_ARG ) = 6 basic index sets,

I1={1,2},I2={1,3},I3={1,4},formulae-sequencesubscript𝐼112formulae-sequencesubscript𝐼213subscript𝐼314\displaystyle I_{1}=\{1,2\},I_{2}=\{1,3\},I_{3}=\{1,4\},italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { 1 , 2 } , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { 1 , 3 } , italic_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = { 1 , 4 } ,
I4={2,3},I5={2,4},I6={3,4}.formulae-sequencesubscript𝐼423formulae-sequencesubscript𝐼524subscript𝐼634\displaystyle I_{4}=\{2,3\},I_{5}=\{2,4\},I_{6}=\{3,4\}.italic_I start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = { 2 , 3 } , italic_I start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = { 2 , 4 } , italic_I start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = { 3 , 4 } .

Of these, I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is optimal, and the rest suboptimal, with x=(1,1/2);superscript𝑥112x^{*}=(1,\nicefrac{{1}}{{2}});italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( 1 , / start_ARG 1 end_ARG start_ARG 2 end_ARG ) ; I3subscript𝐼3I_{3}italic_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is rank-deficient while the rest are full-rank; I3subscript𝐼3I_{3}italic_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and I4subscript𝐼4I_{4}italic_I start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are infeasible, while the rest are feasible.

Noisy Activation.

For SLBs, instead of the true constraint matrix A𝐴Aitalic_A, doss must work with noisy estimates of it, the A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARGs. We extend the notion of BIS activation to handle this fuzziness in constraints.

Definition 6.4.

The set of points that noisily activates a BIS I𝐼Iitalic_I at time t𝑡titalic_t is

𝒳~tI:={x𝒮~t:A~𝓒t,A~(I)x=α(I)}.assignsuperscriptsubscript~𝒳𝑡𝐼conditional-set𝑥subscript~𝒮𝑡formulae-sequence~𝐴subscript𝓒𝑡~𝐴𝐼𝑥𝛼𝐼\widetilde{\mathcal{X}}_{t}^{I}:=\{x\in\widetilde{\mathcal{S}}_{t}:\exists% \tilde{A}\in\boldsymbol{\mathcal{{C}}}_{t},\tilde{A}(I)x=\alpha(I)\}.over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT := { italic_x ∈ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : ∃ over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_A end_ARG ( italic_I ) italic_x = italic_α ( italic_I ) } .

Note that 𝒳~tI𝒮~t𝒳superscriptsubscript~𝒳𝑡𝐼subscript~𝒮𝑡𝒳\widetilde{\mathcal{X}}_{t}^{I}\subset\widetilde{\mathcal{S}}_{t}\subset% \mathcal{X}over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ⊂ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ caligraphic_X. The main structural result is the following observation.

Proposition 6.5.

The actions of doss must noisily activate at least one BIS, i.e. t,It:xt𝒳~tIt:for-all𝑡subscript𝐼𝑡subscript𝑥𝑡superscriptsubscript~𝒳𝑡subscript𝐼𝑡\forall t,\exists I_{t}:x_{t}\in\widetilde{\mathcal{X}}_{t}^{I_{t}}∀ italic_t , ∃ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

If xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT noisily activates the BIS I𝐼Iitalic_I at time t𝑡titalic_t, we shall say that I𝐼Iitalic_I is played at time t𝑡titalic_t. Note that more than one BIS may be played at a time (since xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can lie in the intersection of many 𝒳~tIsuperscriptsubscript~𝒳𝑡𝐼\widetilde{\mathcal{X}}_{t}^{I}over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPTs).

6.2 Gaps of Suboptimal BISs

We argue that if doss noisily activates a suboptimal BIS at t𝑡titalic_t, then the noise scale ρt(xt;δ)subscript𝜌𝑡subscript𝑥𝑡𝛿\rho_{t}(x_{t};\delta)italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) must be large. To show this, we develop two gaps for suboptimal BISs: the feasibility gap and the efficacy gap, which respectively exploit the permissibility and optimism of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Our results will lower bound ρt(xt;δ)subscript𝜌𝑡subscript𝑥𝑡𝛿\rho_{t}(x_{t};\delta)italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) by the larger of these gaps when suboptimal BISs are played. The overall constructions are essentially via a reduction to global linear programming sensitivity analysis. This is necessary: since we do not know the constraints in A𝐴Aitalic_A or θ𝜃\thetaitalic_θ, perturbations in this matrix (as represented by A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG) may, and indeed do, cause the optimal xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to appear suboptimal.

The basic structure we use is the following localisation of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTs played by doss, proved in §D.1 as a simple consequence of Lemma 3.2. From here onwards, we shall just write ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT instead of ρt(xt;δ)subscript𝜌𝑡subscript𝑥𝑡𝛿\rho_{t}(x_{t};\delta)italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ).

Lemma 6.6.

For ζ[0,),𝜁0\zeta\in[0,\infty),italic_ζ ∈ [ 0 , ∞ ) , define the activation polytope of I𝐼Iitalic_I at scale ζ𝜁\zetaitalic_ζ as

𝒯(ζ;I):={x:Axα+ζ𝟏U,A(I)xα(I)ζ𝟏U(I)}.assign𝒯𝜁𝐼conditional-set𝑥formulae-sequence𝐴𝑥𝛼𝜁subscript1𝑈𝐴𝐼𝑥𝛼𝐼𝜁subscript1𝑈𝐼\mathcal{T}(\zeta;I):=\{x:Ax\leq\alpha+\zeta\mathbf{1}_{U},A(I)x\geq\alpha(I)-% \zeta\mathbf{1}_{U}(I)\}.caligraphic_T ( italic_ζ ; italic_I ) := { italic_x : italic_A italic_x ≤ italic_α + italic_ζ bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT , italic_A ( italic_I ) italic_x ≥ italic_α ( italic_I ) - italic_ζ bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( italic_I ) } .

If the confidence sets are consistent, and if the action of doss at time t𝑡titalic_t, xt,subscript𝑥𝑡x_{t},italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , noisily activates the BIS I,𝐼I,italic_I , then xt𝒯(ρt;I)subscript𝑥𝑡𝒯subscript𝜌𝑡𝐼x_{t}\in\mathcal{T}(\rho_{t};I)italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_T ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_I ), and further, θ,xxtρt𝜃superscript𝑥subscript𝑥𝑡subscript𝜌𝑡\langle\theta,x^{*}-x_{t}\rangle\leq\rho_{t}⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≤ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

6.2.1 Intuitive Illustration of Gaps

Refer to caption
Figure 4: Illustration of gaps in Ex. 6.3. xI1superscript𝑥subscript𝐼1x^{I_{1}}italic_x start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the purple dot, and the activation polytope 𝒯(ρt;I1)𝒯subscript𝜌𝑡subscript𝐼1\mathcal{T}(\rho_{t};I_{1})caligraphic_T ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is shown in purple, along with the separation γ(I1)𝛾subscript𝐼1\gamma(I_{1})italic_γ ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). The spread 𝔰(I1)𝔰subscript𝐼1\mathfrak{s}(I_{1})fraktur_s ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is the inner product of the direction in which 𝒯𝒯\mathcal{T}caligraphic_T varies and θ𝜃\thetaitalic_θ. For I4,subscript𝐼4I_{4},italic_I start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , the feasibility gap ζ(I)subscript𝜁𝐼\zeta_{*}(I)italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) is illustrated geometrically in orange.

To expose the key components that allow doss to control the play of suboptimal BISs, we will first consider the feasible, full-rank, and suboptimal BIS I1={1,2}subscript𝐼112I_{1}=\{1,2\}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { 1 , 2 } in Ex. 6.3. Due to the full-rank, the constraints of I𝐼Iitalic_I are activated by a unique point, xIsuperscript𝑥𝐼x^{I}italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT. Since I𝐼Iitalic_I is suboptimal, there is a positive ‘efficacy separation’ between xIsuperscript𝑥𝐼x^{I}italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT and x,superscript𝑥x^{*},italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , denoted γ(I):=θ,xxI.assign𝛾𝐼𝜃superscript𝑥superscript𝑥𝐼\gamma(I):=\langle\theta,x^{*}-x^{I}\rangle.italic_γ ( italic_I ) := ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ⟩ . In our example, xI1=(1/2,1/2)superscript𝑥subscript𝐼11212x^{I_{1}}=(\nicefrac{{1}}{{2}},\nicefrac{{1}}{{2}})italic_x start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = ( / start_ARG 1 end_ARG start_ARG 2 end_ARG , / start_ARG 1 end_ARG start_ARG 2 end_ARG ), and γ(I1)=1/2𝛾subscript𝐼112\gamma(I_{1})=\nicefrac{{1}}{{2}}italic_γ ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = / start_ARG 1 end_ARG start_ARG 2 end_ARG.

Efficacy Gap.

Under noisy activation of I𝐼Iitalic_I, the point xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT may depart from xIsuperscript𝑥𝐼x^{I}italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, but it cannot go too far. Indeed, by Lemma 6.6, xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must lie in the activation polytope 𝒯(ρt;I),𝒯subscript𝜌𝑡𝐼\mathcal{T}(\rho_{t};I),caligraphic_T ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_I ) , which is a skewed subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-box of scale ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT containing xI.superscript𝑥𝐼x^{I}.italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT . In our example, 𝒯(ρt;I1)={x:x1=x2,x21/2±ρt}.𝒯subscript𝜌𝑡subscript𝐼1conditional-set𝑥formulae-sequencesubscript𝑥1subscript𝑥2subscript𝑥2plus-or-minus12subscript𝜌𝑡\mathcal{T}(\rho_{t};I_{1})=\{x:x_{1}=x_{2},x_{2}\in\nicefrac{{1}}{{2}}\pm\rho% _{t}\}.caligraphic_T ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = { italic_x : italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ / start_ARG 1 end_ARG start_ARG 2 end_ARG ± italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } . This localisation constrains how large θ,xt𝜃subscript𝑥𝑡\langle\theta,x_{t}\rangle⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ can be. Indeed, there exists a constant 𝔰(I)𝔰𝐼\mathfrak{s}(I)fraktur_s ( italic_I ), which we call the spread of I𝐼Iitalic_I, such that maxx𝒯(ζ;I)θ,xxIζ𝔰(I)subscript𝑥𝒯𝜁𝐼𝜃𝑥superscript𝑥𝐼𝜁𝔰𝐼\max_{x\in\mathcal{T}(\zeta;I)}\langle\theta,x-x^{I}\rangle\leq\zeta\mathfrak{% s}(I)roman_max start_POSTSUBSCRIPT italic_x ∈ caligraphic_T ( italic_ζ ; italic_I ) end_POSTSUBSCRIPT ⟨ italic_θ , italic_x - italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ⟩ ≤ italic_ζ fraktur_s ( italic_I ). In effect, 𝔰(I)𝔰𝐼\mathfrak{s}(I)fraktur_s ( italic_I ) is a measure of how well the geometry induced by I𝐼Iitalic_I near xIsuperscript𝑥𝐼x^{I}italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT aligns with θ𝜃\thetaitalic_θ, e.g., for I1,𝔰(I1)subscript𝐼1𝔰subscript𝐼1I_{1},\mathfrak{s}(I_{1})italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , fraktur_s ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is the inner product between θ𝜃\thetaitalic_θ and (1,1),11(1,1),( 1 , 1 ) , the direction along which 𝒯𝒯\mathcal{T}caligraphic_T varies.

Thus, θ,xtθ,xI+ρt𝔰(I).𝜃subscript𝑥𝑡𝜃superscript𝑥𝐼subscript𝜌𝑡𝔰𝐼\langle\theta,x_{t}\rangle\leq\langle\theta,x^{I}\rangle+\rho_{t}\mathfrak{s}(% I).⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≤ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ⟩ + italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT fraktur_s ( italic_I ) . But, since θ,xIx=γ(I)𝜃superscript𝑥𝐼superscript𝑥𝛾𝐼\langle\theta,x^{I}-x^{*}\rangle=-\gamma(I)⟨ italic_θ , italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ = - italic_γ ( italic_I ), this implies θ,xtθ,xγ(I)+ρt𝔰(I)𝜃subscript𝑥𝑡𝜃superscript𝑥𝛾𝐼subscript𝜌𝑡𝔰𝐼\langle\theta,x_{t}\rangle\leq\langle\theta,x^{*}\rangle-\gamma(I)+\rho_{t}% \mathfrak{s}(I)⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≤ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ - italic_γ ( italic_I ) + italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT fraktur_s ( italic_I ). This lies in tension with Lemma 6.6, which states that θ,xtθ,xρt𝜃subscript𝑥𝑡𝜃superscript𝑥subscript𝜌𝑡\langle\theta,x_{t}\rangle\geq\langle\theta,x^{*}\rangle-\rho_{t}⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ - italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Resolving this tension yields the lower bound ρtη(I):=γ(I)/(1+𝔰(I)).subscript𝜌𝑡subscript𝜂𝐼assign𝛾𝐼1𝔰𝐼\rho_{t}\geq\eta_{*}(I):=\gamma(I)/(1+\mathfrak{s}(I)).italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) := italic_γ ( italic_I ) / ( 1 + fraktur_s ( italic_I ) ) . We call the constant η(I)subscript𝜂𝐼\eta_{*}(I)italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) the efficacy gap of I𝐼Iitalic_I. For Ex. 6.3, η(I1)=1/8.subscript𝜂subscript𝐼118\eta_{*}(I_{1})=\nicefrac{{1}}{{8}}.italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = / start_ARG 1 end_ARG start_ARG 8 end_ARG .

Safety Gap.

It is also possible that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT noisily activates an infeasible BIS, such as I4={2,3}subscript𝐼423I_{4}=\{2,3\}italic_I start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = { 2 , 3 } in Ex. 6.3. In this case, a conflict arises between the inequalities defining the activation polytope 𝒯(ζ;I)𝒯𝜁𝐼\mathcal{T}(\zeta;I)caligraphic_T ( italic_ζ ; italic_I ): if I𝐼Iitalic_I is infeasible, then 𝒯(0;I)=𝒳I={Axα,A(I)xα(I)}=𝒯0𝐼superscript𝒳𝐼formulae-sequence𝐴𝑥𝛼𝐴𝐼𝑥𝛼𝐼\mathcal{T}(0;I)=\mathcal{X}^{I}=\{Ax\leq\alpha,A(I)x\geq\alpha(I)\}=\emptysetcaligraphic_T ( 0 ; italic_I ) = caligraphic_X start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = { italic_A italic_x ≤ italic_α , italic_A ( italic_I ) italic_x ≥ italic_α ( italic_I ) } = ∅, and by right-continuity 𝒯(ζ;I)𝒯𝜁𝐼\mathcal{T}(\zeta;I)caligraphic_T ( italic_ζ ; italic_I ) is empty for small ζ𝜁\zetaitalic_ζ. Let us define ζ(I)subscript𝜁𝐼\zeta_{*}(I)italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) to be the smallest scale at which 𝒯(ζ;I)𝒯𝜁𝐼\mathcal{T}(\zeta;I)caligraphic_T ( italic_ζ ; italic_I ) is nonempty. Since xt𝒯(ρt;I)subscript𝑥𝑡𝒯subscript𝜌𝑡𝐼x_{t}\in\mathcal{T}(\rho_{t};I)italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_T ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_I ), it follows that if xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT activates a BIS I𝐼Iitalic_I, then ρtζ(I)subscript𝜌𝑡subscript𝜁𝐼\rho_{t}\geq\zeta_{*}(I)italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ). We call ζ(I)subscript𝜁𝐼\zeta_{*}(I)italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) the safety gap of I𝐼Iitalic_I. In Ex. 6.3, 𝒯(ζ;I4)={x:x1=1,x1=x2,x21/2+ζ},𝒯𝜁subscript𝐼4conditional-set𝑥formulae-sequencesubscript𝑥11formulae-sequencesubscript𝑥1subscript𝑥2subscript𝑥212𝜁\mathcal{T}(\zeta;I_{4})=\{x:x_{1}=1,x_{1}=x_{2},x_{2}\leq\nicefrac{{1}}{{2}}+% \zeta\},caligraphic_T ( italic_ζ ; italic_I start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = { italic_x : italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_ζ } , and so ζ(I4)=1/2.subscript𝜁subscript𝐼412\zeta_{*}(I_{4})=\nicefrac{{1}}{{2}}.italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = / start_ARG 1 end_ARG start_ARG 2 end_ARG .

Summary.

The above illustrates two basic tensions in selecting suboptimal BISs. If a BIS I𝐼Iitalic_I is infeasible, then activating it requires that ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT dominates its safety gap, and if I𝐼Iitalic_I is feasible but suboptimal, then activation requires that ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT exceeds its efficacy gap. We formalise this concept below.

6.2.2 Formal Definitions of the Gaps

We give a unified treatment of the safety and efficacy gaps by analysing a parameterised LP with feasible set determined by the local structure induced by a BIS I𝐼Iitalic_I, as encapsulated in Lemma 6.6.

Definition 6.7.

For a BIS I𝐼Iitalic_I, and ζ0,𝜁0\zeta\geq 0,italic_ζ ≥ 0 , the optimistic LP at scale ζ𝜁\zetaitalic_ζ induced by I𝐼Iitalic_I is defined as P(ζ;I):=sup{θ,x:x𝒯(ζ;I)},assign𝑃𝜁𝐼supremumconditional-set𝜃𝑥𝑥𝒯𝜁𝐼P(\zeta;I):=\sup\{\langle\theta,x\rangle:x\in\mathcal{T}(\zeta;I)\},italic_P ( italic_ζ ; italic_I ) := roman_sup { ⟨ italic_θ , italic_x ⟩ : italic_x ∈ caligraphic_T ( italic_ζ ; italic_I ) } , with the convention that sup=supremum\sup\emptyset=-\inftyroman_sup ∅ = - ∞.

Since by Lemma 6.6, xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT lies in 𝒯(ρt;I)𝒯subscript𝜌𝑡𝐼\mathcal{T}(\rho_{t};I)caligraphic_T ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_I ) if it noisily activates I𝐼Iitalic_I, this yields θ,xtP(ρt;I).𝜃subscript𝑥𝑡𝑃subscript𝜌𝑡𝐼\langle\theta,x_{t}\rangle\leq P(\rho_{t};I).⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≤ italic_P ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_I ) . So, the behaviour of P(ζ;I)𝑃𝜁𝐼P(\zeta;I)italic_P ( italic_ζ ; italic_I ) with ζ𝜁\zetaitalic_ζ let us capture the tensions we illustrated in the previous section.

Definition 6.8.

We define the feasibility gap of a BIS I𝐼Iitalic_I as

ζ(I):=inf{ζ0:P(ζ;I)>}.assignsubscript𝜁𝐼infimumconditional-set𝜁0𝑃𝜁𝐼\zeta_{*}(I):=\inf\{\zeta\geq 0:P(\zeta;I)>-\infty\}.italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) := roman_inf { italic_ζ ≥ 0 : italic_P ( italic_ζ ; italic_I ) > - ∞ } .

We define the efficacy separation of I𝐼Iitalic_I as γ(I):=θ,xP(ζ(I);I),assign𝛾𝐼𝜃superscript𝑥𝑃subscript𝜁𝐼𝐼\gamma(I):=\langle\theta,x^{*}\rangle-P(\zeta_{*}(I);I),italic_γ ( italic_I ) := ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ - italic_P ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ; italic_I ) , and the spread of I𝐼Iitalic_I as 𝔰(I):=inf{C:ζζ(I),P(ζ;I)P(ζ(I))+C(ζζ(I)),\mathfrak{s}(I):=\inf\{C:\forall\zeta\geq\zeta_{*}(I),P(\zeta;I)\leq P(\zeta_{% *}(I))+C(\zeta-\zeta_{*}(I)),fraktur_s ( italic_I ) := roman_inf { italic_C : ∀ italic_ζ ≥ italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) , italic_P ( italic_ζ ; italic_I ) ≤ italic_P ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) + italic_C ( italic_ζ - italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) , which yield the efficacy gap of I𝐼Iitalic_I,

η(I)=γ(I)+ζ(I)𝔰(I)1+𝔰(I).subscript𝜂𝐼𝛾𝐼subscript𝜁𝐼𝔰𝐼1𝔰𝐼\eta_{*}(I)=\frac{\gamma(I)+\zeta_{*}(I)\mathfrak{s}(I)}{1+\mathfrak{s}(I)}.italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) = divide start_ARG italic_γ ( italic_I ) + italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) fraktur_s ( italic_I ) end_ARG start_ARG 1 + fraktur_s ( italic_I ) end_ARG .

The definitions above concretise the quantities described in §6.2.1. The key consequence of these definitions is the following ‘noise-scale lower bound on activating poor BISs,’ shown in §D.2.

Lemma 6.9.

For any suboptimal BIS I𝐼Iitalic_I, max(ζ(I),η(I))>0subscript𝜁𝐼subscript𝜂𝐼0\max(\zeta_{*}(I),\eta_{*}(I))>0roman_max ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) , italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) > 0. Further, under consistency of the confidence sets, if xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT activates a suboptimal BIS I𝐼Iitalic_I, then ρt(xt;δ)max(ζ(I),η(I))subscript𝜌𝑡subscript𝑥𝑡𝛿subscript𝜁𝐼subscript𝜂𝐼\rho_{t}(x_{t};\delta)\geq\max(\zeta_{*}(I),\eta_{*}(I))italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ≥ roman_max ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) , italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ).

Note here that the noise-scale needed to play I𝐼Iitalic_I is driven by the larger of the efficacy and safety gap at I𝐼Iitalic_I. This is natural: these quantities measure the ‘extent’ of infeasibility or inefficacy of I𝐼Iitalic_I, and thus the larger one determines the rate at which evidence of the suboptimality of I𝐼Iitalic_I is accumulated.

6.3 Gap of the Problem, and Controlling the Play of Suboptimal BISs

In light of Lemma 6.9, the following is natural.

Definition 6.10.

The gap of an SLB instance is defined as Γ:=minImax(ζ(I),η(I))assignΓsubscript𝐼subscript𝜁𝐼subscript𝜂𝐼\Gamma:=\min_{I}\max(\zeta_{*}(I),\eta_{*}(I))roman_Γ := roman_min start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT roman_max ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) , italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ).

The main result of this section shows that Γ2superscriptΓ2\Gamma^{-2}roman_Γ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT bounds how often suboptimal BISs are played.

Theorem 6.11.

Let {xt}subscript𝑥𝑡\{x_{t}\}{ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } denote the actions of doss(δ𝛿\deltaitalic_δ) on an SLB instance. Then, with probability at least 1δ,1𝛿1-\delta,1 - italic_δ , if at any time t,xt𝑡subscript𝑥𝑡t,x_{t}italic_t , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT noisily activates a suboptimal BIS, then ρt(xt;δ)>Γsubscript𝜌𝑡subscript𝑥𝑡𝛿Γ\rho_{t}(x_{t};\delta)>\Gammaitalic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) > roman_Γ. Further, the total number of times suboptimal BISs are played is bounded as

t𝟙{suboptimal BIS I:xt𝒳~tI}=O(Γ2(d2log2T+dlog(T)log(U/δ))).subscript𝑡1conditional-setsuboptimal BIS 𝐼subscript𝑥𝑡superscriptsubscript~𝒳𝑡𝐼𝑂superscriptΓ2superscript𝑑2superscript2𝑇𝑑𝑇𝑈𝛿\sum_{t}\mathds{1}\{\exists\textrm{suboptimal BIS }I:x_{t}\in\widetilde{% \mathcal{X}}_{t}^{I}\}=O\left(\Gamma^{-2}\left(d^{2}\log^{2}T+d\log(T)\log(U/% \delta)\right)\right).∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT blackboard_1 { ∃ suboptimal BIS italic_I : italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT } = italic_O ( roman_Γ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T + italic_d roman_log ( italic_T ) roman_log ( italic_U / italic_δ ) ) ) .

This result, shown in §D.3, implies that most of the time, doss plays actions such that the noisy constraints they activate are precisely those that xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT saturates. In other words, while the method may not be able to locate xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT itself with precision better than O(1/t),𝑂1𝑡O(1/\sqrt{t}),italic_O ( 1 / square-root start_ARG italic_t end_ARG ) , it can identify the binding constraints, and, most of the time, the actions of doss focus on activating these constraints.

7 Controlling Efficacy Regret and Total Safety Violation

We now come to the main results of the paper. The previous section tells us that suboptimal BISs cannot be played too often, effectively controlling a ‘dual’ type of regret. We proceed to translate these results into bounds on the ‘primal’ quantities Tsubscript𝑇\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. This requires us to account for the times when only optimal BISs (I𝐼Iitalic_I such that xIsuperscript𝑥𝐼x^{*}\in Iitalic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_I) are played. We can control the behaviour of such times under the following weak nondegeneracy condition at the optimum.

Assumption 7.1.

Every optimal BIS (i.e., I:x𝒳I:𝐼superscript𝑥superscript𝒳𝐼I:x^{*}\in\mathcal{X}^{I}italic_I : italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT) is full-rank. Further, the noise wtSsuperscriptsubscript𝑤𝑡𝑆w_{t}^{S}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT is generic in the sense that the probability that it lies in any subspace of less than d𝑑ditalic_d dimensions is zero.

Note that the condition does not require the uniqueness of the optimum. Instead, nondegeneracy is demanded in the sense that any size d𝑑ditalic_d subset of all the constraints that xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT saturates constitutes a full rank BIS. The effect of this is to mainly exclude pathologies, such as the case in 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT where two identical constraints are placed on the system, and both pass through the optimum (i.e., (ai,αi)=c(aj,αj)superscript𝑎𝑖superscript𝛼𝑖𝑐superscript𝑎𝑗superscript𝛼𝑗(a^{i},\alpha^{i})=c(a^{j},\alpha^{j})( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = italic_c ( italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_α start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) for some pair i,j𝑖𝑗i,jitalic_i , italic_j). Notice that in standard linear programmming, such constraints would be eliminated during pre-processing, which we cannot do since we do not know all of the constraints. Nevertheless, since the constraints represent limitations on different safety scores, it is unlikely in practice that these would be linearly dependent. Further, note that Assumption 7.1 allows xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to be degenerate in the sense that it may lie on many more than d𝑑ditalic_d constraints. Of course, the genericity of noise is a standard condition, and can be met by adding an arbitrarily small continuous noise to the feedback. The main utility of this assumption is the following result, which is argued in §E.1.

Lemma 7.2.

Under assumption 7.1, if the confidence sets are consistent, td+1,𝑡𝑑1t\geq d+1,italic_t ≥ italic_d + 1 , and the action xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of doss(δ𝛿\deltaitalic_δ) is that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT only noisily activates the optimal BIS, then θ,xtθ,x𝜃subscript𝑥𝑡𝜃superscript𝑥\langle\theta,x_{t}\rangle\geq\langle\theta,x^{*}\rangle⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩.

In other words, when only the optimal BISs are played, the action xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT cannot be ineffective! The proof relies on using the optimal BIS I𝐼Iitalic_I to construct a ‘localised’ program that the solutions (θ~t,xt)subscript~𝜃𝑡subscript𝑥𝑡(\tilde{\theta}_{t},x_{t})( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and witness A~tsubscript~𝐴𝑡\tilde{A}_{t}over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of (3) must also optimise. The assumption is used to make this part effective, and in general the same holds if θrow-span(A(I))𝜃row-span𝐴𝐼\theta\in\textrm{row-span}(A(I))italic_θ ∈ row-span ( italic_A ( italic_I ) ). The final statement then follows through an elementary, but involved, analysis of structure of optimal solutions of this localised program.

Coupling the above with Theorem 6.11 yields our main result, shown in §E.2

Theorem 7.3.

Under assumption 7.1, w.p. 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ, the actions of doss(δ𝛿\deltaitalic_δ) yield

T=O(Γ1(d2log2T+dlogTlog(U/δ))), and 𝒮T=O~(d2T(log2T+logTlog(U/δ))).formulae-sequencesubscript𝑇𝑂superscriptΓ1superscript𝑑2superscript2𝑇𝑑𝑇𝑈𝛿 and subscript𝒮𝑇~𝑂superscript𝑑2𝑇superscript2𝑇𝑇𝑈𝛿\mathscr{E}_{T}=O\left(\Gamma^{-1}(d^{2}\log^{2}T+d\log T\log(U/\delta))\right% ),\textit{ and }\mathscr{S}_{T}=\widetilde{O}\left(\sqrt{d^{2}T(\log^{2}T+\log T% \log(U/\delta))}\right).script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_O ( roman_Γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T + italic_d roman_log italic_T roman_log ( italic_U / italic_δ ) ) ) , and script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T + roman_log italic_T roman_log ( italic_U / italic_δ ) ) end_ARG ) .

In light of Theorem 5.1, we see that up to polylog factors, doss saturates the lower bound, with a bias towards minimising the efficacy regret. While the gain in efficacy performance over PO methods is evident, we again stress the advantage in terms of the lack of prior knowledge of a safe ball in 𝒳.𝒳\mathcal{X}.caligraphic_X . We further note that the costs scale logarithmically with the number of unknown constraints, U𝑈Uitalic_U.

Tightness of Dependence on ΓΓ\Gammaroman_Γ.

Exploiting a subtle reduction of safe Multi-Armed Bandits problems to SLB problems, we show in §F.2 that the inverse dependence on ΓΓ\Gammaroman_Γ is necessary.

Theorem 7.4.

Fix a c(0,1)𝑐01c\in(0,1)italic_c ∈ ( 0 , 1 ). For any Γ1/16,Γ116\Gamma\leq\nicefrac{{1}}{{16}},roman_Γ ≤ / start_ARG 1 end_ARG start_ARG 16 end_ARG , and any method that ensures that in every SLB instance, max(T,𝒮T)=O(T1c),subscript𝑇subscript𝒮𝑇𝑂superscript𝑇1𝑐\max(\mathscr{E}_{T},\mathscr{S}_{T})=O(T^{1-c}),roman_max ( script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_O ( italic_T start_POSTSUPERSCRIPT 1 - italic_c end_POSTSUPERSCRIPT ) , there exists an instance of the SLB problem with gap at least Γ,Γ\Gamma,roman_Γ , such that lim infmax(𝔼[T],𝔼[𝒮T])logTc/108Γ1.limit-infimum𝔼delimited-[]subscript𝑇𝔼delimited-[]subscript𝒮𝑇𝑇𝑐108superscriptΓ1\liminf\frac{\max(\mathbb{E}[\mathscr{E}_{T}],\mathbb{E}[\mathscr{S}_{T}])}{% \log T}\geq\nicefrac{{c}}{{108}}\cdot\Gamma^{-1}.lim inf divide start_ARG roman_max ( blackboard_E [ script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] , blackboard_E [ script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] ) end_ARG start_ARG roman_log italic_T end_ARG ≥ / start_ARG italic_c end_ARG start_ARG 108 end_ARG ⋅ roman_Γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT .

7.1 Improved Safety Performance Under Tolerance

While Theorem 7.3 is tight in terms of 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, given that it achieves polylogarithmic Tsubscript𝑇\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, the polynomial dependence can nevertheless be considered prohibitive. To improve upon this, we study three concrete scenarios in which this dependence may be improved. At the core, each of these cases relaxes the SLB problem so that the precision barrier discussed in §5 does not arise, thus illustrating that this condition is the sole obstruction to polylogarithmic control on 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

Finite Precision Slack in Constraint Levels.

As a first pass, we may allow for a finite amount of violation of constraints without any penalty, e.g., through the ε𝜀\varepsilonitalic_ε-precision metric 𝒮Tε:=tTmaxi(ai,xαi)+𝟙{i:ai,xαi>ε}.\mathscr{S}_{T}^{\varepsilon}:=\sum_{t\leq T}\max_{i}(\langle a^{i},x\rangle-% \alpha^{i})_{+}\mathds{1}\{\exists i:\langle a^{i},x\rangle-\alpha^{i}>% \varepsilon\}.script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ε end_POSTSUPERSCRIPT := ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT blackboard_1 { ∃ italic_i : ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT > italic_ε } . Such a relaxation is quite pertinent in scenarios such as drug trials or engineering design applications (where ε𝜀\varepsilonitalic_ε can be set to a small factor of αisuperscript𝛼𝑖\alpha^{i}italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT) or if the αisuperscript𝛼𝑖\alpha^{i}italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT are estimated values777this is quite common: process and measurement variations mean that an exact threshold for the quality of components necessary to ensure safe behaviour is not known, and must usually be fixed empirically. (where ε𝜀\varepsilonitalic_ε can be the error level in these estimates). In this context, we show in §E.3.1 that

Theorem 7.5.

With probability at least 1δ,1𝛿1-\delta,1 - italic_δ , doss(δ𝛿\deltaitalic_δ) ensures that simultaneously for every ε>0𝜀0\varepsilon>0italic_ε > 0

T=O(Γ1d2log2T) and 𝒮T=O(ε1d2log2T).formulae-sequencesubscript𝑇𝑂superscriptΓ1superscript𝑑2superscript2𝑇 and subscript𝒮𝑇𝑂superscript𝜀1superscript𝑑2superscript2𝑇\mathscr{E}_{T}=O\left(\Gamma^{-1}{d^{2}\log^{2}T}\right)\quad\textit{ and }% \quad\mathscr{S}_{T}=O\left(\varepsilon^{-1}d^{2}\log^{2}T\right).script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_O ( roman_Γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) and script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_O ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) .

The main point of interest in the result above is that it holds simultaneously for every value of ε𝜀\varepsilonitalic_ε. Indeed, doss does not need ε𝜀\varepsilonitalic_ε as a parameter, and it only arises in the analysis. This means that the method adapts to the precision requirements of the domain at hand. Note further that setting ε=Tc𝜀superscript𝑇𝑐\varepsilon=T^{-c}italic_ε = italic_T start_POSTSUPERSCRIPT - italic_c end_POSTSUPERSCRIPT for c>1/2𝑐12c>\nicefrac{{1}}{{2}}italic_c > / start_ARG 1 end_ARG start_ARG 2 end_ARG yields tT(maxiai,xtαiTc)+=O~(T1c),subscript𝑡𝑇subscriptsubscript𝑖superscript𝑎𝑖subscript𝑥𝑡superscript𝛼𝑖superscript𝑇𝑐~𝑂superscript𝑇1𝑐\sum_{t\leq T}(\max_{i}\langle a^{i},x_{t}\rangle-\alpha^{i}-T^{-c})_{+}=% \smash{\widetilde{O}(T^{1-c})},∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_T start_POSTSUPERSCRIPT - italic_c end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 1 - italic_c end_POSTSUPERSCRIPT ) , i.e., as T,𝑇T\nearrow\infty,italic_T ↗ ∞ , doss rapidly converges towards feasibility, and gains over Theorem 7.3 are realised with decaying precision slack.

Finite Precision in Constraint Parameters.

Rather than treating the precision in the constraint levels, it may be possible that the constraint parameters are restricted to a finite grid. Generically, such a structure arises in settings modeled as integer programs (up to a unit factor), and particular examples include drug discovery (e.g. Radhakrishnan and Tidor, 2008), where constraints indicate requirements that a compound binds to certain receptors, and so are naturally binary. We can formalise this by specifying a finite set 𝖯𝖯\mathsf{P}sansserif_P which describes the ‘grid’ that A𝐴Aitalic_A must lie in. Naturally, we can modify doss to exploit this by restricting the construction of 𝒮~t(δ)subscript~𝒮𝑡𝛿\widetilde{\mathcal{S}}_{t}(\delta)over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) in (2) to A~𝓒t𝖯=𝓒t𝖯.~𝐴superscriptsubscript𝓒𝑡𝖯subscript𝓒𝑡𝖯\tilde{A}\in\boldsymbol{\mathcal{{C}}}_{t}^{\mathsf{P}}=\boldsymbol{\mathcal{{% C}}}_{t}\cap\mathsf{P}.over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_P end_POSTSUPERSCRIPT = bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ sansserif_P . We argue in §E.3.2 that this change implicitly introduces a finite set of possible actions when only optimal BISs are activated, which in turn yields the following result.

Theorem 7.6.

If the constraint parameters lie in a finite precision set, then there exists a constant π>0𝜋0\pi>0italic_π > 0 such that w.p. 1δ,absent1𝛿\geq 1-\delta,≥ 1 - italic_δ , the actions of doss(δ𝛿\deltaitalic_δ) satisfy max(T,𝒮T)=O(min(Γ,π)1d2log2T).\max(\mathscr{E}_{T},\mathscr{S}_{T})=O(\min(\Gamma,\pi)^{-1}d^{2}\log^{2}T).roman_max ( script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_O ( roman_min ( roman_Γ , italic_π ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) .

Finite Action Spaces.

Finally, if we instead consider the commonly studied case of only having a finite number of possible actions (Abbasi-Yadkori et al., 2011; Dani et al., 2008; Agrawal and Devanur, 2016), then the issues of primal precision do not arise, since we do not need to exactly know the constraints in order to exactly locate any action. If we simply define Δ=min𝒳max(θ,xx,maxi(ai,xαi)+),\Delta=\min_{\mathcal{X}}\max(\langle\theta,x^{*}-x\rangle,\max_{i}(\langle a^% {i},x\rangle-\alpha^{i})_{+}),roman_Δ = roman_min start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT roman_max ( ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x ⟩ , roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) , then merely employing the techniques described in §4 yields (see§E.3.3)

Proposition 7.7.

Over finite actions spaces, with probability at least 1δ,1𝛿1-\delta,1 - italic_δ , the actions of doss(δ𝛿\deltaitalic_δ) ensure that max(T,𝒮T)=O(Δ1d2log2T).subscript𝑇subscript𝒮𝑇𝑂superscriptΔ1superscript𝑑2superscript2𝑇\max(\mathscr{E}_{T},\mathscr{S}_{T})=O(\Delta^{-1}d^{2}\log^{2}T).roman_max ( script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_O ( roman_Δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) .

8 Simulations

We verify the theoretical study above with simulations over Example 6.3, and study the relative performance of doss and the optimistic-pessimistic method Safe-LTS Moradipari et al. (2021). These implementations are based on the following relaxation of Algorithm 1.

Computationally Feasible Relaxation.

A well-known barrier to implementing Algorithm 1 is that even if all constraints were known, the program (3) is non-convex (Dani et al., 2008). In our case, this is further complicated by the fact that the set 𝒮~tsubscript~𝒮𝑡\widetilde{\mathcal{S}}_{t}over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT needs to be determined. Following Dani et al. (2008), we approach these issues by constructing box confidence sets, i.e.,

𝓒t,1:={A~:i,(a~ia^i)Vt1/21dβt}.assignsubscript𝓒𝑡1conditional-set~𝐴for-all𝑖subscriptnormsuperscript~𝑎𝑖superscript^𝑎𝑖superscriptsubscript𝑉𝑡121𝑑subscript𝛽𝑡\boldsymbol{\mathcal{{C}}}_{t,1}:=\{\tilde{A}:\forall i,\|(\tilde{a}^{i}-\hat{% a}^{i})V_{t}^{1/2}\|_{1}\leq\sqrt{d\beta_{t}}\}.bold_caligraphic_C start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT := { over~ start_ARG italic_A end_ARG : ∀ italic_i , ∥ ( over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ square-root start_ARG italic_d italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG } .

Since 21d2,𝓒t,1𝓒t.\|\cdot\|_{2}\leq\|\cdot\|_{1}\leq\sqrt{d}\|\cdot\|_{2},\boldsymbol{\mathcal{{% C}}}_{t,1}\subset\boldsymbol{\mathcal{{C}}}_{t}.∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ ⋅ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ square-root start_ARG italic_d end_ARG ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_caligraphic_C start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ⊂ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . Further, due to the same equivalence, the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-based analysis persists, up to a blowup of d𝑑\sqrt{d}square-root start_ARG italic_d end_ARG in ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and thus running doss with 𝓒t,1subscript𝓒𝑡1\boldsymbol{\mathcal{{C}}}_{t,1}bold_caligraphic_C start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT worsens our bounds from (d2log2T,d2T)superscript𝑑2superscript2𝑇superscript𝑑2𝑇(d^{2}\log^{2}T,\sqrt{d^{2}T})( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T , square-root start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG ) to (d3log2T,d3T)superscript𝑑3superscript2𝑇superscript𝑑3𝑇(d^{3}\log^{2}T,\sqrt{d^{3}T})( italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T , square-root start_ARG italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_T end_ARG ).

The main advantage of 𝓒t,1subscript𝓒𝑡1\boldsymbol{\mathcal{{C}}}_{t,1}bold_caligraphic_C start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT lies in the fact that the box-confidence sets are polytopes. Due to this, the A~tsubscript~𝐴𝑡\tilde{A}_{t}over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that are active for the optimistic action xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must lie at the extreme points of these sets. Since each set has only 2d2𝑑2d2 italic_d extreme points, this allows us to determine xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by solving (2d)U+1superscript2𝑑𝑈1(2d)^{U+1}( 2 italic_d ) start_POSTSUPERSCRIPT italic_U + 1 end_POSTSUPERSCRIPT convex programs, which is computationally feasible so long as U𝑈Uitalic_U is small. Of course, this complexity remains painfully slow as U𝑈Uitalic_U grows. Finding versions of doss that are computationally practical for a large number of unknown constraints remains an interesting open problem.

Setting.

We implement doss on with the Lsubscript𝐿L_{\infty}italic_L start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT relaxation above on the instance of Example 6.3 over the horizon T=104,𝑇superscript104T=10^{4},italic_T = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , and with the parameters λ=2,δ=1/(4T)=2.5×105.formulae-sequence𝜆2𝛿14𝑇2.5superscript105\lambda=2,\delta=1/(4T)=2.5\times 10^{-5}.italic_λ = 2 , italic_δ = 1 / ( 4 italic_T ) = 2.5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT . The noise in observations is independent and Gaussian, with variance 0.10.10.10.1. Notice that for this instance, Γ=1/8Γ18\Gamma=\nicefrac{{1}}{{8}}roman_Γ = / start_ARG 1 end_ARG start_ARG 8 end_ARG.

Behaviour of doss.

Our main observation is that doss is very effective, and has well-controlled violations. Figure 5 shows the efficacy regret tsubscript𝑡\mathscr{E}_{t}script_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and both the arbitrary precision safety violations 𝒮tsubscript𝒮𝑡\mathscr{S}_{t}script_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the finite precision safety violations 𝒮tεsuperscriptsubscript𝒮𝑡𝜀\mathscr{S}_{t}^{\varepsilon}script_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ε end_POSTSUPERSCRIPT for the value ε=0.05=2Γ/5𝜀0.052Γ5\varepsilon=0.05=2\Gamma/5italic_ε = 0.05 = 2 roman_Γ / 5. The simulations validate our main claims of strong efficacy regret control, and well-behaved growth of safety violations. Indeed, observe that the efficacy regret is essentially zero over most of the runs (with rare runs rising to 104100subscriptsuperscript104100\mathscr{E}_{10^{4}}\approx 100script_E start_POSTSUBSCRIPT 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≈ 100). This property arises since doss very rarely plays suboptimal BISs (see the following discussion and Figure 6), and when it plays the optimal BIS, it plays a ‘over-efficient’ but unsafe point. Further, the extent of the lack of safety of the actions chosen by doss is well-controlled, as seen in the behaviour of 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. The finite precision regret shows even stronger control, with growth essentially halted at t5000,𝑡5000t\approx 5000,italic_t ≈ 5000 , validating the analysis underlying Theorem 7.3.

Refer to caption
Refer to caption
Figure 5: Efficacy Regret and Safety Violation of doss . We plot averages and one standard deviation confidence regions over 30 runs for Tsubscript𝑇\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT (left) and both 𝒮tsubscript𝒮𝑡\mathscr{S}_{t}script_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒮t0.05superscriptsubscript𝒮𝑡0.05\mathscr{S}_{t}^{0.05}script_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0.05 end_POSTSUPERSCRIPT (right). We also plot the upper bounds we show in the latter to contextualise the observations. Observe that the efficacy regret is marginal: the mean is essentially 0,00,0 , and the variance limited. Further, observe that the growth of the net safety violation 𝒮tsubscript𝒮𝑡\mathscr{S}_{t}script_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is well-controlled, and lies far below the bounds of §7. Further, the finite precision violations show a strong flattening, as is expected from Theorem 7.3.
doss rarely activates suboptimal index sets.

In Figure 6, we plot the number of times that doss noisily activates a suboptimal BIS, i.e., any index set other than I2={1,3}subscript𝐼213I_{2}=\{1,3\}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { 1 , 3 }. The main observation is that this occurs very rarely: indeed, over the horizon of 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, most runs do not activate suboptimal BISs more than 100 times. This is far below the upper bound of Theorem 6.11.

Refer to caption
Figure 6: Suboptimal BIS activation by doss in the instance of Example 6.3. Observe that such activation is very rare, typically far less than 1%percent11\%1 % of the times, and the growth is essentially flat.
doss Compares Favourably with Pessimistic-Optimistic Methods.

To contextualise our method, we also implement the PO-method safe-LTS due to Moradipari et al. (2021) in the instance of Example 6.3. Instead of the optimistic permissible set 𝒮~~𝒮\widetilde{\mathcal{S}}over~ start_ARG caligraphic_S end_ARG, safe-LTS constructs a pessimistic set Πt={x:A~𝓒t,A~xα}subscriptΠ𝑡conditional-set𝑥formulae-sequencefor-all~𝐴subscript𝓒𝑡~𝐴𝑥𝛼\Pi_{t}=\{x:\forall\tilde{A}\in\boldsymbol{\mathcal{{C}}}_{t},\tilde{A}x\leq\alpha\}roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_x : ∀ over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_A end_ARG italic_x ≤ italic_α }. Note that with high probability, all points in ΠtsubscriptΠ𝑡\Pi_{t}roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must be safe. The method then selects actions optimistically, in this case by exploiting Thompson sampling. Naturally, this method requires the knowledge of a safe point with margin to being with, and we supply the point x𝗌=(0,0)superscript𝑥𝗌00x^{\mathsf{s}}=(0,0)italic_x start_POSTSUPERSCRIPT sansserif_s end_POSTSUPERSCRIPT = ( 0 , 0 ) to the method, which has the (large) margin M𝗌=1/2superscript𝑀𝗌12M^{\mathsf{s}}=\nicefrac{{1}}{{2}}italic_M start_POSTSUPERSCRIPT sansserif_s end_POSTSUPERSCRIPT = / start_ARG 1 end_ARG start_ARG 2 end_ARG.

Refer to caption
Refer to caption
Figure 7: Comparing the behaviour of doss and safe-LTS on the instance of Example 6.3. The left plot shows the raw efficacy regret, while the right plot is the raw safety violations of the two methods, and each reports means and one-standard deviation confidence regions over 30 runs.. Observe that the efficacy performance of safe-LTS is extremely poor, indicating that the algorithm is far from the boundary of the safe set 𝒮𝒮\mathcal{S}caligraphic_S for most of its runs. In contrast, the violation properties of doss are well-controlled, and almost four times smaller than the efficacy regret of safe-LTS.

Figure 7 compares the behaviour of the raw efficacy regret θ,xxt𝜃superscript𝑥subscript𝑥𝑡\sum\langle\theta,x^{*}-x_{t}\rangle∑ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ (left) and the raw safety violation maxi(ai,xtαi)subscript𝑖superscript𝑎𝑖subscript𝑥𝑡superscript𝛼𝑖\sum\max_{i}(\langle a^{i},x_{t}\rangle-\alpha^{i})∑ roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) (right) of Safe-LTS and doss (since the efficacy regret of doss , and the safety-violations of safe-LTS are both essentially 00, the raw behaviour elucidates more insight). As expected, safe-LTS suffers from 00 safety regret, since it plays in a pessimistic set. However, this is accompanied by a large efficacy regret, with the mean of over 7000700070007000 at the horizon T=104𝑇superscript104T=10^{4}italic_T = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. This arises due to the extreme conservatism of this method, which is evident from its safety violation property: the method has a strong negative (and decreasing still) violation, indicating that it continues to play deep in the interior of the domain for large T𝑇Titalic_T. Indeed, since over the domain, a,xα[0.5,0.5],𝑎𝑥𝛼0.50.5\langle a,x\rangle-\alpha\in[-0.5,0.5],⟨ italic_a , italic_x ⟩ - italic_α ∈ [ - 0.5 , 0.5 ] , and since the violation at T=104𝑇superscript104T=10^{4}italic_T = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT is roughly 3000,3000-3000,- 3000 , this indicates that with a nontrivial probability, the method remains at least 0.250.250.250.25-separated from the boundary of the safe set.

In comparison, observe that the raw efficacy regret of doss is negative, but not nearly as far as the violations of safe-LTS. This indicates that the method is shrinking towards the boundary of the safe set at a much better rate. Of course, this property is similarly illustrated by the violation behaviour: this nearly four times smaller than the efficacy regret of safe-LTS, and concentrates strongly to 800absent800\approx 800≈ 800 at T=104𝑇superscript104T=10^{4}italic_T = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT.

9 Discussion

The SLB problem is inherently challenging due to the roundwise enforcement of constraints. Our works offers new, and refined insights into both the hardness of the problem through our instance-dependent superlogarithmic lower bound, and to the effectiveness of doubly-optimistic methods for the same through our strong control on T.subscript𝑇\mathscr{E}_{T}.script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT . In the process, we developed a new dual viewpoint of the SLB problem, by develo** gaps for sets of constraints, which we believe is a conceptually important tool for such problems. Of course, a number of interesting questions remain open, e.g., are there computationally efficient ways to implement doubly-optimistic strategies for large U𝑈Uitalic_U; or if one can design methods that attain the strong safety guarantees of PO methods, but without making the strong assumptions of prior knowledge of safe points. We believe that tackling these challenges is key to the effective use of bandit feedback in practical scenarios.

Acknowledgements.

We acknowledge support by the Air Force Research Laboratory grant FA8650-22-C1039, Army Research Office grant W911NF2110246, and the National Science Foundation grants CCF-2007350 and CCF-1955981.

References

  • Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312–2320, 2011.
  • Afsharrad et al. (2023) Amirhossein Afsharrad, Ahmadreza Moradipari, and Sanjay Lall. Convex methods for constrained linear bandits. arXiv preprint arXiv:2311.04338, 2023.
  • Agrawal and Devanur (2016) Shipra Agrawal and Nikhil Devanur. Linear contextual bandits with knapsacks. Advances in Neural Information Processing Systems, 29:3450–3458, 2016.
  • Agrawal and Devanur (2014) Shipra Agrawal and Nikhil R Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 989–1006, 2014.
  • Agrawal et al. (2016) Shipra Agrawal, Nikhil R Devanur, and Lihong Li. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In Conference on Learning Theory, pages 4–18. PMLR, 2016.
  • Amani et al. (2019) Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Linear stochastic bandits under safety constraints. arXiv preprint arXiv:1908.05814, 2019.
  • Badanidiyuru et al. (2013) Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 207–216. IEEE, 2013.
  • Badanidiyuru et al. (2014) Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. Resourceful contextual bandits. In Conference on Learning Theory, pages 1109–1134. PMLR, 2014.
  • Bernasconi et al. (2022) Martino Bernasconi, Federico Cacciamani, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti, and Francesco Trovò. Safe learning in tree-form sequential decision making: Handling hard and soft constraints. In International Conference on Machine Learning, pages 1854–1873. PMLR, 2022.
  • Bertsimas and Tsitsiklis (1997) Dimitris Bertsimas and John N Tsitsiklis. Introduction to linear optimization, volume 6. Athena scientific Belmont, MA, 1997.
  • Camilleri et al. (2022) Romain Camilleri, Andrew Wagenmaker, Jamie Morgenstern, Lalit Jain, and Kevin Jamieson. Active learning with safety constraints. arXiv preprint arXiv:2206.11183, 2022.
  • Carlsson et al. (2023) Emil Carlsson, Debabrota Basu, Fredrik D Johansson, and Devdatt Dubhashi. Pure exploration in bandits with linear constraints. arXiv preprint arXiv:2306.12774, 2023.
  • Chen et al. (2022) Tianrui Chen, Aditya Gangrade, and Venkatesh Saligrama. Strategies for safe multi-armed bandits with logarithmic regret and risk. In Proceedings of the 39th International Conference on Machine Learning, pages 3123–3148, 2022.
  • Dani et al. (2008) Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, 2008.
  • Efroni et al. (2020) Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189, 2020.
  • Gales et al. (2022) Spencer B Gales, Sunder Sethuraman, and Kwang-Sung Jun. Norm-agnostic linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 73–91. PMLR, 2022.
  • Hutchinson et al. (2023) Spencer Hutchinson, Berkay Turan, and Mahnoosh Alizadeh. The impact of the geometric properties of the constraint set in safe optimization with bandit feedback. In Learning for Dynamics and Control Conference, pages 497–508. PMLR, 2023.
  • Katz-Samuels and Scott (2019) Julian Katz-Samuels and Clayton Scott. Top feasible arm identification. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1593–1601. PMLR, 2019.
  • Lattimore and Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Liu et al. (2021) Xin Liu, Bin Li, Pengyi Shi, and Lei Ying. An efficient pessimistic-optimistic algorithm for stochastic linear bandits with general constraints. Advances in Neural Information Processing Systems, 34:24075–24086, 2021.
  • Moradipari et al. (2021) Ahmadreza Moradipari, Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Safe linear thompson sampling with side information. IEEE Transactions on Signal Processing, 2021.
  • Pacchiano et al. (2021) Aldo Pacchiano, Mohammad Ghavamzadeh, Peter Bartlett, and Heinrich Jiang. Stochastic bandits with linear constraints. In International Conference on Artificial Intelligence and Statistics, pages 2827–2835. PMLR, 2021.
  • Pacchiano et al. (2024) Aldo Pacchiano, Mohammad Ghavamzadeh, and Peter Bartlett. Contextual bandits with stage-wise constraints. arXiv preprint arXiv:2401.08016, 2024.
  • Radhakrishnan and Tidor (2008) Mala L Radhakrishnan and Bruce Tidor. Optimal drug cocktail design: methods for targeting molecular ensembles and insights from theoretical model systems. Journal of chemical information and modeling, 48(5):1055–1073, 2008.
  • Shamir (2015) Ohad Shamir. On the complexity of bandit linear optimization. In Conference on Learning Theory, pages 1523–1551. PMLR, 2015.
  • Shao et al. (2018) Han Shao, Xiaotian Yu, Irwin King, and Michael R Lyu. Almost optimal algorithms for linear stochastic bandits with heavy-tailed payoffs. Advances in Neural Information Processing Systems, 31, 2018.
  • Turchetta et al. (2016) Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in finite markov decision processes with gaussian processes. Advances in Neural Information Processing Systems, 29, 2016.
  • Varma et al. (2023) K Nithin Varma, Sahin Lale, and Anima Anandkumar. Stochastic linear bandits with unknown safety constraints and local feedback. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023.
  • Vaswani et al. (2022) Sharan Vaswani, Lin Yang, and Csaba Szepesvari. Near-optimal sample complexity bounds for constrained MDPs. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • Wachi and Sui (2020) Akifumi Wachi and Yanan Sui. Safe reinforcement learning in constrained markov decision processes. In International Conference on Machine Learning, pages 9797–9806. PMLR, 2020.
  • Wang et al. (2022) Zhenlin Wang, Andrew J Wagenmaker, and Kevin Jamieson. Best arm identification with safety constraints. In International Conference on Artificial Intelligence and Statistics, pages 9114–9146. PMLR, 2022.
  • Wu et al. (2016) Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262. PMLR, 2016.

Appendix A Related Work on Pure Exploration.

While we study the regret formulation, work on constrained bandits has naturally also appeared in the pure exploration setting. The typical such paper aims to recover arms that are both nearly-safe and nearly-optimal, in a PAC sense. Katz-Samuels and Scott (2019) study this quesiton for finite-armed bandits, and Wang et al. (2022) extend this study under a structured multi-armed bandit setting where each arm has a continuous parameter that must be selected, and monotonically affects reward and safety of the arm. Most pertinently, Camilleri et al. (2022); Carlsson et al. (2023) study best feasible arm identifaction in the linear bandit setting with the same structure as us, although they assume that the set of possible actions is finite and known a priori. It is interesting to note that even in the identification setting, where safety is not enforced during learning, methods that can identify good arms quickly can only give guarantees of safety up to a given precision. This complements our observations in the regret setting.

Appendix B On the Assumptions, and Background on Online Linear Regression

We give an expanded discussion of the standard assumptions made in §2, and discuss a standard result from online linear regression controlling xtVt11subscriptnormsubscript𝑥𝑡superscriptsubscript𝑉𝑡11\sum\|x_{t}\|_{V_{t-1}^{-1}}∑ ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT that is key to our analysis.

B.1 A closer look at the assumptions

The assumptions made in the main text are slightly simplified version of standard assumptions from the literature on linear bandits.

Boundedness. The boundedness assumption has two parts: firstly that the underlying parameters are bounded, i.e., θ,ai1norm𝜃normsuperscript𝑎𝑖1\|\theta\|,\|a^{i}\|\leq 1∥ italic_θ ∥ , ∥ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ ≤ 1 and secondly we assume that the domain is bounded, i.e., x1norm𝑥1\|x\|\leq 1∥ italic_x ∥ ≤ 1 for all x𝒳={Bxβ}𝑥𝒳𝐵𝑥𝛽x\in\mathcal{X}=\{Bx\leq\beta\}italic_x ∈ caligraphic_X = { italic_B italic_x ≤ italic_β }.

The bounded domain assumption is used chiefly to ensure that the underlying optimisation problem of interest has finite value. Quantitatively, this may be replaced with a generic bound xLnorm𝑥𝐿\|x\|\leq L∥ italic_x ∥ ≤ italic_L instead without appreciably changing the study. The principal way this affects doss is via the choice of the regulariser: instead of setting Vt=(I+xsxs),subscript𝑉𝑡𝐼subscript𝑥𝑠superscriptsubscript𝑥𝑠topV_{t}=(I+\sum x_{s}x_{s}^{\top}),italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_I + ∑ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) , this requires us to set Vt=λI+xsxssubscript𝑉𝑡𝜆𝐼subscript𝑥𝑠superscriptsubscript𝑥𝑠topV_{t}=\lambda I+\sum x_{s}x_{s}^{\top}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_λ italic_I + ∑ italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT for some λ>L2𝜆superscript𝐿2\lambda>L^{2}italic_λ > italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Concretely, the validity of of appropriate modification of Lemma 3.2 to handle general regularisation requires using

ωt(δ;λ)=12log((U+1)det(Vt)1/2det(λI)1/2δ)+λ1/2subscript𝜔𝑡𝛿𝜆12𝑈1superscriptsubscript𝑉𝑡12superscript𝜆𝐼12𝛿superscript𝜆12\sqrt{\omega_{t}(\delta;\lambda)}=\sqrt{\frac{1}{2}\log\left(\frac{(U+1)\det(V% _{t})^{1/2}\det(\lambda I)^{-1/2}}{\delta}\right)}+\lambda^{1/2}square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ; italic_λ ) end_ARG = square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( divide start_ARG ( italic_U + 1 ) roman_det ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT roman_det ( italic_λ italic_I ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ end_ARG ) end_ARG + italic_λ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT

for a λ𝜆\lambdaitalic_λ that λmaxtxt2,𝜆subscript𝑡superscriptnormsubscript𝑥𝑡2\lambda\geq\max_{t}\|x_{t}\|^{2},italic_λ ≥ roman_max start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , which may be ensured by setting λL2𝜆superscript𝐿2\lambda\geq L^{2}italic_λ ≥ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The main paper simplifies this notational clutter by just setting λ=1𝜆1\lambda=1italic_λ = 1 and assuming x1norm𝑥1\|x\|\leq 1∥ italic_x ∥ ≤ 1. A second aspect that is affected by the quantity L𝐿Litalic_L is that the upper bound of Lemma B.1 would read log(1+TL2/λd)1𝑇superscript𝐿2𝜆𝑑\log(1+TL^{2}/\lambda d)roman_log ( 1 + italic_T italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_λ italic_d ) instead of log(1+T/d)1𝑇𝑑\log(1+T/d)roman_log ( 1 + italic_T / italic_d ), which mildly affects some logarithmic terms in the regret bounds (and in fact no bound reported in the main text needs modification if we set λL2𝜆superscript𝐿2\lambda\geq L^{2}italic_λ ≥ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and assume that Ld𝐿𝑑L\leq ditalic_L ≤ italic_d).

The assumption of bounded parameters is largely without loss of generality - indeed, if we had a bound θ,maxiaiSnorm𝜃subscript𝑖normsuperscript𝑎𝑖𝑆\|\theta\|,\max_{i}\|a^{i}\|\leq S∥ italic_θ ∥ , roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ ≤ italic_S instead, the only change required is that the confidence set radius ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT would need to be set as

ωt(δ;λ,S)=ωt(δ;λ)+(S1)λ,subscript𝜔𝑡𝛿𝜆𝑆subscript𝜔𝑡𝛿𝜆𝑆1𝜆\sqrt{\omega_{t}(\delta;\lambda,S)}=\sqrt{\omega_{t}(\delta;\lambda)}+(S-1)% \sqrt{\lambda},square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ; italic_λ , italic_S ) end_ARG = square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ; italic_λ ) end_ARG + ( italic_S - 1 ) square-root start_ARG italic_λ end_ARG ,

i.e., only the additive λ𝜆\sqrt{\lambda}square-root start_ARG italic_λ end_ARG term in ωtsubscript𝜔𝑡\sqrt{\omega_{t}}square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG above would need adjustment. We note that in general, the norm bounds on the various aisuperscript𝑎𝑖a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and θ𝜃\thetaitalic_θ need not agree, and it is in fact possible to adapt to their norms without prior knowledge of the same, by setting distinct ωtisuperscriptsubscript𝜔𝑡𝑖\omega_{t}^{i}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPTs for each aisuperscript𝑎𝑖a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, and using the techniques of the recent work of Gales et al. (2022).

SubGaussianity. While the subGaussianity condition can also be relaxed (for instance, linear bandits with heavy tailed noise have been studied (Shao et al., 2018)), it yields significant technical convenience whilst remaining quite a generic setting. In the assumption, we concretely assume that the noise is conditionally 1111-subGaussian. This may be relaxed to conditionally R𝑅Ritalic_R-subGaussian. This too can be handled with a small change in ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to

ωt(δ;λ,R)=R12log((U+1)det(Vt)1/2det(λI)1/2δ)+λ1/2.subscript𝜔𝑡𝛿𝜆𝑅𝑅12𝑈1superscriptsubscript𝑉𝑡12superscript𝜆𝐼12𝛿superscript𝜆12\sqrt{\omega_{t}(\delta;\lambda,R)}=R\sqrt{\frac{1}{2}\log\left(\frac{(U+1)% \det(V_{t})^{1/2}\det(\lambda I)^{-1/2}}{\delta}\right)}+\lambda^{1/2}.square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ; italic_λ , italic_R ) end_ARG = italic_R square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( divide start_ARG ( italic_U + 1 ) roman_det ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT roman_det ( italic_λ italic_I ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ end_ARG ) end_ARG + italic_λ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT .

This change is somewhat stronger than the corresponding change induced by altering θnorm𝜃\|\theta\|∥ italic_θ ∥ and ainormsuperscript𝑎𝑖\|a^{i}\|∥ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥, since the scaling is now applied to the first term of ωt,subscript𝜔𝑡\omega_{t},italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , which grows with t𝑡titalic_t unlike the constant λ𝜆\sqrt{\lambda}square-root start_ARG italic_λ end_ARG penalty.

Overall Confidence Radius with General Parameters. To sum up, under the generic conditions xL,θS,aiS,formulae-sequencenorm𝑥𝐿formulae-sequencenorm𝜃𝑆normsuperscript𝑎𝑖𝑆\|x\|\leq L,\|\theta\|\leq S,\|a^{i}\|\leq S,∥ italic_x ∥ ≤ italic_L , ∥ italic_θ ∥ ≤ italic_S , ∥ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ ≤ italic_S , and R𝑅Ritalic_R-subGaussianity of {γti}superscriptsubscript𝛾𝑡𝑖\{\gamma_{t}^{i}\}{ italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT }, the entirety of our following analysis will go through, but with the blown up confidence radii

ωt(δ;λ,L,S,R)=R12log((U+1)det(Vt)1/2det(λI)1/2δ)+Sλ1/2,subscript𝜔𝑡𝛿𝜆𝐿𝑆𝑅𝑅12𝑈1superscriptsubscript𝑉𝑡12superscript𝜆𝐼12𝛿𝑆superscript𝜆12\sqrt{\omega_{t}(\delta;\lambda,L,S,R)}=R\sqrt{\frac{1}{2}\log\left(\frac{(U+1% )\det(V_{t})^{1/2}\det(\lambda I)^{-1/2}}{\delta}\right)}+S\lambda^{1/2},square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ; italic_λ , italic_L , italic_S , italic_R ) end_ARG = italic_R square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( divide start_ARG ( italic_U + 1 ) roman_det ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT roman_det ( italic_λ italic_I ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ end_ARG ) end_ARG + italic_S italic_λ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ,

and under the condition λL2𝜆superscript𝐿2\lambda\geq L^{2}italic_λ ≥ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. This results in roughly an increase in the regret bounds of a factor of at most max(R,S),𝑅𝑆\max(R,S),roman_max ( italic_R , italic_S ) , along with a potential increase in the logarithmic terms to log(1+TL2/δ)1𝑇superscript𝐿2𝛿\log(1+TL^{2}/\delta)roman_log ( 1 + italic_T italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_δ ) instead of log(1+T/δ)1𝑇𝛿\log(1+T/\delta)roman_log ( 1 + italic_T / italic_δ ). For the remainder of our analysis, we shall stick to the default parameters R=S=L=λ=1𝑅𝑆𝐿𝜆1R=S=L=\lambda=1italic_R = italic_S = italic_L = italic_λ = 1.

B.2 Quantitative Bounds from the Theory of Online Linear Regression

We conclude the preliminaries with the following generic statement, which holds due to a couple of applications of the matrix-determinant lemma. The result is standard - see the discussions of Abbasi-Yadkori et al. (2011, Lemma 11) for historical discussions.

Lemma B.1.

Let {xt}subscript𝑥𝑡\{x_{t}\}{ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } be the actions of doss. Suppose that for all t𝑡titalic_t, xt1normsubscript𝑥𝑡1\|x_{t}\|\leq 1∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≤ 1, and let λ1𝜆1\lambda\geq 1italic_λ ≥ 1. Then for any T𝑇Titalic_T,

t=1TxtVt11232log(det(VT)det(λI))32dlog(1+Tλd).superscriptsubscript𝑡1𝑇superscriptsubscriptnormsubscript𝑥𝑡superscriptsubscript𝑉𝑡11232detsubscript𝑉𝑇det𝜆𝐼32𝑑1𝑇𝜆𝑑\sum_{t=1}^{T}\|x_{t}\|_{V_{t-1}^{-1}}^{2}\leq\frac{3}{2}\log\left(\frac{% \mathrm{det}(V_{T})}{\mathrm{det}(\lambda I)}\right)\leq\frac{3}{2}d\log\left(% 1+\frac{T}{\lambda d}\right).∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 3 end_ARG start_ARG 2 end_ARG roman_log ( divide start_ARG roman_det ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG roman_det ( italic_λ italic_I ) end_ARG ) ≤ divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_d roman_log ( 1 + divide start_ARG italic_T end_ARG start_ARG italic_λ italic_d end_ARG ) .
Proof of Lemma B.1.

First notice that since Vt=Vt1+xtxt,subscript𝑉𝑡subscript𝑉𝑡1subscript𝑥𝑡superscriptsubscript𝑥𝑡topV_{t}=V_{t-1}+x_{t}x_{t}^{\top},italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , by the matrix-determinant lemma,

det(Vt)=det(Vt1)det(I+Vt11/2xtxtT(Vt11/2)=det(Vt1)(1+xtVt112),\det(V_{t})=\det(V_{t-1})\det(I+V_{t-1}^{-1/2}x_{t}x_{t}^{T}(V_{t-1}^{-1/2})^{% \top}=\det(V_{t-1})(1+\|x_{t}\|_{V_{t-1}^{-1}}^{2}),roman_det ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_det ( italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) roman_det ( italic_I + italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = roman_det ( italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ( 1 + ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

and induction yields

det(VT)=det(λI)(1+xtVt112).subscript𝑉𝑇𝜆𝐼product1superscriptsubscriptnormsubscript𝑥𝑡superscriptsubscript𝑉𝑡112\det(V_{T})=\det(\lambda I)\prod(1+\|x_{t}\|_{V_{t-1}^{-1}}^{2}).roman_det ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = roman_det ( italic_λ italic_I ) ∏ ( 1 + ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

where we have used that V0=λI.subscript𝑉0𝜆𝐼V_{0}=\lambda I.italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_λ italic_I .

Now, notice that since Vt1λIsucceedssubscript𝑉𝑡1𝜆𝐼V_{t-1}\succ\lambda Iitalic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ≻ italic_λ italic_I for each t𝑡titalic_t, it follows that xtVt112xt2/λ1.superscriptsubscriptnormsubscript𝑥𝑡superscriptsubscript𝑉𝑡112superscriptnormsubscript𝑥𝑡2𝜆1\|x_{t}\|_{V_{t-1}^{-1}}^{2}\leq\|x_{t}\|^{2}/\lambda\leq 1.∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_λ ≤ 1 . But for z[0,1],z32log(1+z),formulae-sequence𝑧01𝑧321𝑧z\in[0,1],z\leq\frac{3}{2}\log(1+z),italic_z ∈ [ 0 , 1 ] , italic_z ≤ divide start_ARG 3 end_ARG start_ARG 2 end_ARG roman_log ( 1 + italic_z ) , which implies that

xtVt11232log(1+xtVt112)=32logdet(VT)det(λI).superscriptsubscriptnormsubscript𝑥𝑡superscriptsubscript𝑉𝑡112321superscriptsubscriptnormsubscript𝑥𝑡superscriptsubscript𝑉𝑡11232subscript𝑉𝑇𝜆𝐼\sum\|x_{t}\|_{V_{t-1}^{-1}}^{2}\leq\frac{3}{2}\sum\log(1+\|x_{t}\|_{V_{t-1}^{% -1}}^{2})=\frac{3}{2}\log\frac{\det(V_{T})}{\det(\lambda I)}.∑ ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 3 end_ARG start_ARG 2 end_ARG ∑ roman_log ( 1 + ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = divide start_ARG 3 end_ARG start_ARG 2 end_ARG roman_log divide start_ARG roman_det ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG roman_det ( italic_λ italic_I ) end_ARG .

Finally, note that since VTsubscript𝑉𝑇V_{T}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is positive definite, by an application of the AM-GM inequality, det(VT)(trace(VT)/d)dsubscript𝑉𝑇superscripttracesubscript𝑉𝑇𝑑𝑑{\det(V_{T})}\leq(\mathrm{trace}(V_{T})/d)^{d}roman_det ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≤ ( roman_trace ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) / italic_d ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and further, trace(VT)=dλ+txt22dλ+T.tracesubscript𝑉𝑇𝑑𝜆subscript𝑡superscriptsubscriptnormsubscript𝑥𝑡22𝑑𝜆𝑇\mathrm{trace}(V_{T})=d\lambda+\sum_{t}\|x_{t}\|_{2}^{2}\leq d\lambda+T.roman_trace ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_d italic_λ + ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_d italic_λ + italic_T . Further observing that det(λI)=λd,𝜆𝐼superscript𝜆𝑑\det(\lambda I)=\lambda^{d},roman_det ( italic_λ italic_I ) = italic_λ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , we conclude that

logdet(VT)det(V)dlog(dλ+T)/dλ=dlog(1+Tdλ).subscript𝑉𝑇𝑉𝑑𝑑𝜆𝑇𝑑𝜆𝑑1𝑇𝑑𝜆\log\frac{\det(V_{T})}{\det(V)}\leq d\log\frac{(d\lambda+T)/d}{\lambda}=d\log% \left(1+\frac{T}{d\lambda}\right).roman_log divide start_ARG roman_det ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG roman_det ( italic_V ) end_ARG ≤ italic_d roman_log divide start_ARG ( italic_d italic_λ + italic_T ) / italic_d end_ARG start_ARG italic_λ end_ARG = italic_d roman_log ( 1 + divide start_ARG italic_T end_ARG start_ARG italic_d italic_λ end_ARG ) .

An immediate consequence of the above is the following pair of observations which we shall use frequently.

Lemma B.2.

Let {xt}subscript𝑥𝑡\{x_{t}\}{ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } be the actions of doss run with the parameters λ,δ𝜆𝛿\lambda,\deltaitalic_λ , italic_δ. For every T>0,𝑇0T>0,italic_T > 0 ,

tTρt(xt;δ)2subscript𝑡𝑇subscript𝜌𝑡superscriptsubscript𝑥𝑡𝛿2\displaystyle\sum_{t\leq T}\rho_{t}(x_{t};\delta)^{2}∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 3d2log2(1+Tλd)+6dlog(1+Tλd)(logU+1δ+2λ),absent3superscript𝑑2superscript21𝑇𝜆𝑑6𝑑1𝑇𝜆𝑑𝑈1𝛿2𝜆\displaystyle\leq 3d^{2}\log^{2}\left(1+\frac{T}{\lambda d}\right)+6d\log\left% (1+\frac{T}{\lambda d}\right)\left(\log\frac{U+1}{\delta}+2\lambda\right),≤ 3 italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + divide start_ARG italic_T end_ARG start_ARG italic_λ italic_d end_ARG ) + 6 italic_d roman_log ( 1 + divide start_ARG italic_T end_ARG start_ARG italic_λ italic_d end_ARG ) ( roman_log divide start_ARG italic_U + 1 end_ARG start_ARG italic_δ end_ARG + 2 italic_λ ) , (4)
tTρt(xt;δ)subscript𝑡𝑇subscript𝜌𝑡subscript𝑥𝑡𝛿\displaystyle\sum_{t\leq T}\rho_{t}(x_{t};\delta)∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) d3Tlog(1+logTdλ)+3dTlog(1+Tλd)(2λ+logU+1δ).absent𝑑3𝑇1𝑇𝑑𝜆3𝑑𝑇1𝑇𝜆𝑑2𝜆𝑈1𝛿\displaystyle\leq d\sqrt{3T}\log\left(1+\frac{\log T}{d\lambda}\right)+\sqrt{3% dT\log\left(1+\frac{T}{\lambda d}\right)}\left(\sqrt{2\lambda}+\sqrt{\log\frac% {U+1}{\delta}}\right).≤ italic_d square-root start_ARG 3 italic_T end_ARG roman_log ( 1 + divide start_ARG roman_log italic_T end_ARG start_ARG italic_d italic_λ end_ARG ) + square-root start_ARG 3 italic_d italic_T roman_log ( 1 + divide start_ARG italic_T end_ARG start_ARG italic_λ italic_d end_ARG ) end_ARG ( square-root start_ARG 2 italic_λ end_ARG + square-root start_ARG roman_log divide start_ARG italic_U + 1 end_ARG start_ARG italic_δ end_ARG end_ARG ) . (5)

These bounds supply the core bounds needed to convert the control we develop on ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in §6.3 and §7 into control on Tsubscript𝑇\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Observe that the main terms in the above results do not show dependence on the failure probability parameter δ𝛿\deltaitalic_δ.

Proof of Lemma B.2.

Recall that ρt(xt;δ)=2ωt(δ)xtVt11.subscript𝜌𝑡subscript𝑥𝑡𝛿2subscript𝜔𝑡𝛿subscriptnormsubscript𝑥𝑡superscriptsubscript𝑉𝑡11\rho_{t}(x_{t};\delta)=2\sqrt{\omega_{t}(\delta)}\cdot\|x_{t}\|_{V_{t-1}^{-1}}.italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) = 2 square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) end_ARG ⋅ ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . Further observe that ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is an increasing function of t𝑡titalic_t. Immediately by Lemma B.1,

ρt24ωT(δ)xtVt1126dωT(δ)log(1+Tdλ).superscriptsubscript𝜌𝑡24subscript𝜔𝑇𝛿superscriptsubscriptnormsubscript𝑥𝑡superscriptsubscript𝑉𝑡1126𝑑subscript𝜔𝑇𝛿1𝑇𝑑𝜆\sum\rho_{t}^{2}\leq 4\omega_{T}(\delta)\sum\|x_{t}\|_{V_{t-1}^{-1}}^{2}\leq 6% d\omega_{T}(\delta)\log\left(1+\frac{T}{d\lambda}\right).∑ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 4 italic_ω start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_δ ) ∑ ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 6 italic_d italic_ω start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_δ ) roman_log ( 1 + divide start_ARG italic_T end_ARG start_ARG italic_d italic_λ end_ARG ) .

Further, once again applying Lemma B.1, and noting that (u+v)22u+2v,superscript𝑢𝑣22𝑢2𝑣(\sqrt{u}+\sqrt{v})^{2}\leq 2u+2v,( square-root start_ARG italic_u end_ARG + square-root start_ARG italic_v end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 italic_u + 2 italic_v ,

ωT(δ)subscript𝜔𝑇𝛿\displaystyle\sqrt{\omega_{T}(\delta)}square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_δ ) end_ARG =λ+12log(U+1)δ+14logdet(VT)det(λI)absent𝜆12𝑈1𝛿14subscript𝑉𝑇𝜆𝐼\displaystyle=\sqrt{\lambda}+\sqrt{\frac{1}{2}\log\frac{(U+1)}{\delta}+\frac{1% }{4}\log\frac{\det(V_{T})}{\det(\lambda I)}}= square-root start_ARG italic_λ end_ARG + square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log divide start_ARG ( italic_U + 1 ) end_ARG start_ARG italic_δ end_ARG + divide start_ARG 1 end_ARG start_ARG 4 end_ARG roman_log divide start_ARG roman_det ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG roman_det ( italic_λ italic_I ) end_ARG end_ARG
ωT(δ)absentsubscript𝜔𝑇𝛿\displaystyle\implies\omega_{T}(\delta)⟹ italic_ω start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_δ ) 2λ+log(U+1)δ+d2log(1+Tλd).absent2𝜆𝑈1𝛿𝑑21𝑇𝜆𝑑\displaystyle\leq 2\lambda+\log\frac{(U+1)}{\delta}+\frac{d}{2}\log\left(1+% \frac{T}{\lambda d}\right).≤ 2 italic_λ + roman_log divide start_ARG ( italic_U + 1 ) end_ARG start_ARG italic_δ end_ARG + divide start_ARG italic_d end_ARG start_ARG 2 end_ARG roman_log ( 1 + divide start_ARG italic_T end_ARG start_ARG italic_λ italic_d end_ARG ) .

Multiplying these two bounds controls ρt2.superscriptsubscript𝜌𝑡2\sum\rho_{t}^{2}.∑ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Further, by the Cauchy-Schwarz inequality,

t=1TρtTt=1Tρt2.superscriptsubscript𝑡1𝑇subscript𝜌𝑡𝑇superscriptsubscript𝑡1𝑇superscriptsubscript𝜌𝑡2\sum_{t=1}^{T}\rho_{t}\leq\sqrt{T}\cdot\sqrt{\sum_{t=1}^{T}\rho_{t}^{2}}.∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ square-root start_ARG italic_T end_ARG ⋅ square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

The bound (5) follows upon applying the bound on ρt2superscriptsubscript𝜌𝑡2\sum\rho_{t}^{2}∑ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT above, and then using the trivial relation u+vu+v𝑢𝑣𝑢𝑣\sqrt{u+v}\leq\sqrt{u}+\sqrt{v}square-root start_ARG italic_u + italic_v end_ARG ≤ square-root start_ARG italic_u end_ARG + square-root start_ARG italic_v end_ARG. ∎

Finally, let us argue that the quantity ρt(xt;δ)subscript𝜌𝑡subscript𝑥𝑡𝛿\rho_{t}(x_{t};\delta)italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) indeed controls the noise scale of the problem by showing Lemma 3.2.

Proof of Lemma 3.2.

We refer to the proof of Abbasi-Yadkori et al. (2011, Thm. 2) for the consistency, and observe only that the factor (U+1)/δ𝑈1𝛿(U+1)/\delta( italic_U + 1 ) / italic_δ enters our confidence radius ωt(d)subscript𝜔𝑡𝑑\sqrt{\omega_{t}(d)}square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_d ) end_ARG by hitting their analysis with the union bound to ensure concentration over the unknown objective and over the U𝑈Uitalic_U unknown constraints simultaneously. Of course, the factor of 𝟏Usubscript1𝑈\mathbf{1}_{U}bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT arises in the definition of 𝓒𝓒\boldsymbol{\mathcal{{C}}}bold_caligraphic_C since each known constraint is already ‘estimated’ exactly by setting a^ti=aisubscriptsuperscript^𝑎𝑖𝑡superscript𝑎𝑖\hat{a}^{i}_{t}=a^{i}over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for i[U+1:U+K]i\in[U+1:U+K]italic_i ∈ [ italic_U + 1 : italic_U + italic_K ].

To show that the noise scale limits the deviations θ~θ,x~𝜃𝜃𝑥\langle\tilde{\theta}-\theta,x\rangle⟨ over~ start_ARG italic_θ end_ARG - italic_θ , italic_x ⟩, observe that under the assumption of consistency, θ𝒞tθ𝜃superscriptsubscript𝒞𝑡𝜃\theta\in\mathcal{C}_{t}^{\theta}italic_θ ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT. Therefore

|θ~θ,x||θ~θ^,x|+|θθ^,x|.~𝜃𝜃𝑥~𝜃^𝜃𝑥𝜃^𝜃𝑥|\langle\tilde{\theta}-\theta,x\rangle|\leq|\langle\tilde{\theta}-\hat{\theta}% ,x\rangle|+|\langle\theta-\hat{\theta},x\rangle|.| ⟨ over~ start_ARG italic_θ end_ARG - italic_θ , italic_x ⟩ | ≤ | ⟨ over~ start_ARG italic_θ end_ARG - over^ start_ARG italic_θ end_ARG , italic_x ⟩ | + | ⟨ italic_θ - over^ start_ARG italic_θ end_ARG , italic_x ⟩ | .

By exploiting the positive definiteness of Vt1subscript𝑉𝑡1V_{t-1}italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and the Cauchy-Schwarz inequality, we can further observe that

|θθ^,x|𝜃^𝜃𝑥\displaystyle|\langle\theta-\hat{\theta},x\rangle|| ⟨ italic_θ - over^ start_ARG italic_θ end_ARG , italic_x ⟩ | =|(θθ^)Vt11/2,Vt11/2x|θθ^Vt1xVt11.absent𝜃^𝜃superscriptsubscript𝑉𝑡112superscriptsubscript𝑉𝑡112𝑥subscriptnorm𝜃^𝜃subscript𝑉𝑡1subscriptnorm𝑥superscriptsubscript𝑉𝑡11\displaystyle=|\langle(\theta-\hat{\theta})V_{t-1}^{1/2},V_{t-1}^{-1/2}x% \rangle|\leq\|\theta-\hat{\theta}\|_{V_{t-1}}\cdot\|x\|_{V_{t-1}^{-1}}.= | ⟨ ( italic_θ - over^ start_ARG italic_θ end_ARG ) italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_x ⟩ | ≤ ∥ italic_θ - over^ start_ARG italic_θ end_ARG ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT .

Running the same calculation of θ~~𝜃\tilde{\theta}over~ start_ARG italic_θ end_ARG and adding the bounds, we conclude that

|θ~θ,x|(θθ^Vt1+θ~θ^Vt1)xVt11~𝜃𝜃𝑥subscriptnorm𝜃^𝜃subscript𝑉𝑡1subscriptnorm~𝜃^𝜃subscript𝑉𝑡1subscriptnorm𝑥superscriptsubscript𝑉𝑡11|\langle\tilde{\theta}-\theta,x\rangle|\leq(\|\theta-\hat{\theta}\|_{V_{t-1}}+% \|\tilde{\theta}-\hat{\theta}\|_{V_{t-1}})\|x\|_{V_{t-1}^{-1}}| ⟨ over~ start_ARG italic_θ end_ARG - italic_θ , italic_x ⟩ | ≤ ( ∥ italic_θ - over^ start_ARG italic_θ end_ARG ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∥ over~ start_ARG italic_θ end_ARG - over^ start_ARG italic_θ end_ARG ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT

But both θ,θ~𝒞tθ,𝜃~𝜃superscriptsubscript𝒞𝑡𝜃\theta,\tilde{\theta}\in\mathcal{C}_{t}^{\theta},italic_θ , over~ start_ARG italic_θ end_ARG ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT , which by definitions means that their Vt1subscript𝑉𝑡1V_{t-1}italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT-norm distance from θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG is bounded by ωt(δ).subscript𝜔𝑡𝛿\sqrt{\omega_{t}(\delta)}.square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) end_ARG . The claim is immediate upon recalling that ρt(x;δ):=2ωt(δ)xVt11assignsubscript𝜌𝑡𝑥𝛿2subscript𝜔𝑡𝛿subscriptnorm𝑥superscriptsubscript𝑉𝑡11\rho_{t}(x;\delta):=2\sqrt{\omega_{t}(\delta)}\|x\|_{V_{t-1}^{-1}}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ; italic_δ ) := 2 square-root start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) end_ARG ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.

Of course, the same argument applies to every a~isuperscript~𝑎𝑖\tilde{a}^{i}over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, and thus to A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG. Again, for known constraints, the radius of the confidence set is 00, so a~i=a^i=ai,superscript~𝑎𝑖superscript^𝑎𝑖superscript𝑎𝑖\tilde{a}^{i}=\hat{a}^{i}=a^{i},over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , and hence the factor of 𝟏Usubscript1𝑈\mathbf{1}_{U}bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT in |(A~A)x|ρt(x;δ)𝟏U~𝐴𝐴𝑥subscript𝜌𝑡𝑥𝛿subscript1𝑈|(\tilde{A}-A)x|\leq\rho_{t}(x;\delta)\mathbf{1}_{U}| ( over~ start_ARG italic_A end_ARG - italic_A ) italic_x | ≤ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ; italic_δ ) bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT.

Finally, the bounds on ρt(xt;δ)subscript𝜌𝑡subscript𝑥𝑡𝛿\sum\rho_{t}(x_{t};\delta)∑ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) and ρt(xt;δ)2subscript𝜌𝑡superscriptsubscript𝑥𝑡𝛿2\sum\rho_{t}(x_{t};\delta)^{2}∑ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT follow directly from Lemma B.2. ∎

Appendix C Appendix on the Structural Behaviour of doss

This section is devoted to showing the key structural properties of the behaviour of doss that we discussed in §6. In particular, we show the main result of §6.1, namely that any point that doss plays must noisily activate some BIS. To this end, we first characterise the behaviour of doss relative to polytopes contained in the permissible set. Before stating the same, recall that an extreme point of a polytope (and indeed a closed convex set), is any point that is not contained on a line joining two other points in the polytope. Further, each extreme point of a polytope in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT must satisfy at least d𝑑ditalic_d constraints with equality. For a polytope 𝒫,𝒫\mathcal{P},caligraphic_P , we will denote its extreme points as 𝒫subscript𝒫\mathcal{E}_{\mathcal{P}}caligraphic_E start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT.

Lemma C.1.

Suppose that 𝒫𝒫\mathcal{P}caligraphic_P is a polytope such that 𝒫𝒮~t𝒫subscript~𝒮𝑡\mathcal{P}\subset\widetilde{\mathcal{S}}_{t}caligraphic_P ⊂ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. If doss plays in 𝒫,𝒫\mathcal{P},caligraphic_P , then xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must be an extreme point of 𝒫,𝒫\mathcal{P},caligraphic_P , i.e., xt𝒫xt𝒫subscript𝑥𝑡𝒫subscript𝑥𝑡subscript𝒫x_{t}\in\mathcal{P}\implies x_{t}\in\mathcal{E}_{\mathcal{P}}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_P ⟹ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_E start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT.

Let us first argue that the Proposition 6.5 follows from the above Lemma.

Proof of Proposition 6.5.

For a choice of A~𝓒t,~𝐴subscript𝓒𝑡\tilde{A}\in\boldsymbol{\mathcal{{C}}}_{t},over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , define the polytope

𝒫(A~)={x:A~xα}.𝒫~𝐴conditional-set𝑥~𝐴𝑥𝛼\mathcal{P}(\tilde{A})=\{x:\tilde{A}x\leq\alpha\}.caligraphic_P ( over~ start_ARG italic_A end_ARG ) = { italic_x : over~ start_ARG italic_A end_ARG italic_x ≤ italic_α } .

Now, observe that

𝒮~t={A~𝓒t}{x:A~xα}=A~𝓒t𝒫(A~),subscript~𝒮𝑡subscript~𝐴subscript𝓒𝑡conditional-set𝑥~𝐴𝑥𝛼subscript~𝐴subscript𝓒𝑡𝒫~𝐴\widetilde{\mathcal{S}}_{t}=\bigcup_{\{\tilde{A}\in\boldsymbol{\mathcal{{C}}}_% {t}\}}\{x:\tilde{A}x\leq\alpha\}=\bigcup_{\tilde{A}\in\boldsymbol{\mathcal{{C}% }}_{t}}\mathcal{P}(\tilde{A}),over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT { over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } end_POSTSUBSCRIPT { italic_x : over~ start_ARG italic_A end_ARG italic_x ≤ italic_α } = ⋃ start_POSTSUBSCRIPT over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_P ( over~ start_ARG italic_A end_ARG ) ,

i.e., 𝒮~tsubscript~𝒮𝑡\widetilde{\mathcal{S}}_{t}over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be decomposed as a union of polytopes. But then the selected point xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must lie in one of these polytopes, say 𝒫superscript𝒫\mathcal{P}^{*}caligraphic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Now, we have that 𝒫𝒮~t,superscript𝒫subscript~𝒮𝑡\mathcal{P}^{*}\subset\widetilde{\mathcal{S}}_{t},caligraphic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⊂ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , and xt𝒫,subscript𝑥𝑡superscript𝒫x_{t}\in\mathcal{P}^{*},italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , and so by Lemma C.1, xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must be an extreme point of 𝒫.superscript𝒫\mathcal{P}^{*}.caligraphic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . But this implies that there are at least d𝑑ditalic_d linearly independent constraints amidst the A~xα~𝐴𝑥𝛼\tilde{A}x\leq\alphaover~ start_ARG italic_A end_ARG italic_x ≤ italic_α that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT activates, i.e., that there exists some I[1:m]I\subset[1:m]italic_I ⊂ [ 1 : italic_m ] such that |I|=d𝐼𝑑|I|=d| italic_I | = italic_d and A~(I)x=α(I)~𝐴𝐼𝑥𝛼𝐼\tilde{A}(I)x=\alpha(I)over~ start_ARG italic_A end_ARG ( italic_I ) italic_x = italic_α ( italic_I ). By definition, then xtX~tI,subscript𝑥𝑡subscriptsuperscript~𝑋𝐼𝑡x_{t}\in\widetilde{X}^{I}_{t},italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , showing the claim. ∎

It remains to show the preceding Lemma. Before proceeding, let us comment that the statement above is intuitively obvious, but appears to be somewhat cumbersome to prove (as the argument below suggests, although nothing says that a cleaner proof could not be found). Of course, this statement extends also to the OFUL algorithm, and to our knowledge this has not been directly argued previously: instead, when working on polytopal domains, typically it is directly stated that it suffices to play on the extreme points of the polytope.

Proof of Lemma C.1.

Suppose that xt𝒫subscript𝑥𝑡𝒫x_{t}\in\mathcal{P}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_P. Then, due to the optimistic choice, there also exists some θ~t𝒞t0subscript~𝜃𝑡superscriptsubscript𝒞𝑡0\tilde{\theta}_{t}\in\mathcal{C}_{t}^{0}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT such that

(θ~t,xt)missingargmaxθ~𝒞t0,x𝒫θ~,x.subscript~𝜃𝑡subscript𝑥𝑡missing𝑎𝑟𝑔𝑚𝑎subscript𝑥formulae-sequence~𝜃superscriptsubscript𝒞𝑡0𝑥𝒫~𝜃𝑥(\tilde{\theta}_{t},x_{t})\in\mathop{\mathrm{missing}}{arg\,max}_{\tilde{% \theta}\in\mathcal{C}_{t}^{0},x\in\mathcal{P}}\langle\tilde{\theta},x\rangle.( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ roman_missing italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_x ∈ caligraphic_P end_POSTSUBSCRIPT ⟨ over~ start_ARG italic_θ end_ARG , italic_x ⟩ .

Notice also that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a solution of the linear program maxx𝒫θ~t,xsubscript𝑥𝒫subscript~𝜃𝑡𝑥\max_{x\in\mathcal{P}}\langle\tilde{\theta}_{t},x\rangleroman_max start_POSTSUBSCRIPT italic_x ∈ caligraphic_P end_POSTSUBSCRIPT ⟨ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x ⟩, and so lies on the boundary of 𝒫𝒫\mathcal{P}caligraphic_P. Similarly, θ~tsubscript~𝜃𝑡\tilde{\theta}_{t}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT lies on the boundary of 𝒞t0.superscriptsubscript𝒞𝑡0\mathcal{C}_{t}^{0}.caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT . We need to argue that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must in fact be an extreme point of 𝒫𝒫\mathcal{P}caligraphic_P, i.e., it does not lie in the interior of some face of dimension 1absent1\geq 1≥ 1 of 𝒫.𝒫\mathcal{P}.caligraphic_P .

For this, first suppose for the sake of contradiction that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT lies in the interior of some 1-dimensional face of 𝒫𝒫\mathcal{P}caligraphic_P, say \mathcal{F}caligraphic_F. Let u𝑢uitalic_u be the direction of variation of \mathcal{F}caligraphic_F. Then it must hold that θ~t,u=0,subscript~𝜃𝑡𝑢0\langle\tilde{\theta}_{t},u\rangle=0,⟨ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u ⟩ = 0 , else θ~t,xt+εusubscript~𝜃𝑡subscript𝑥𝑡𝜀𝑢\langle\tilde{\theta}_{t},x_{t}+\varepsilon u\rangle⟨ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ε italic_u ⟩ would exceed θ~t,xtsubscript~𝜃𝑡subscript𝑥𝑡\langle\tilde{\theta}_{t},x_{t}\rangle⟨ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ for some small choice of ε𝜀\varepsilonitalic_ε. Now, let us rotate the domain so that u𝑢uitalic_u is directed along one coordinate axis, and project onto the 2D subspace spanned by the (orthogonal) directions u𝑢uitalic_u and θ~tsubscript~𝜃𝑡\tilde{\theta}_{t}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Next, rescale the vectors so that both u𝑢uitalic_u and θ~tsubscript~𝜃𝑡\tilde{\theta}_{t}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT have norm 1111, and finally translate the polytope so that the u𝑢uitalic_uth component of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is 00. Notice that the projection of an ellipsoid is an ellipsoid, and so doing the same transformations to 𝒞t1θsuperscriptsubscript𝒞𝑡1𝜃\mathcal{C}_{t-1}^{\theta}caligraphic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT produces a 2-dimensional convex confidence ellipsoid D𝐷Ditalic_D.

Let us relabel the axes of the resulting system as u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and u2.subscript𝑢2u_{2}.italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . In the resulting coordinate system, θ~=(0,1),~𝜃01\tilde{\theta}=(0,1),over~ start_ARG italic_θ end_ARG = ( 0 , 1 ) , and \mathcal{F}caligraphic_F is a line segment of the form {u1[p,q],u2=r},formulae-sequencesubscript𝑢1𝑝𝑞subscript𝑢2𝑟\{u_{1}\in[p,q],u_{2}=r\},{ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ [ italic_p , italic_q ] , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_r } , where p<0<q,r=θ~t,xt/θ~tformulae-sequence𝑝0𝑞𝑟subscript~𝜃𝑡subscript𝑥𝑡normsubscript~𝜃𝑡p<0<q,r=\langle\tilde{\theta}_{t},x_{t}\rangle/\|\tilde{\theta}_{t}\|italic_p < 0 < italic_q , italic_r = ⟨ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ / ∥ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ and xt=(0,r)subscript𝑥𝑡0𝑟x_{t}=(0,r)italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( 0 , italic_r ). Observe that θ~tsubscript~𝜃𝑡\tilde{\theta}_{t}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must lie on the boundary of D𝐷Ditalic_D. We shall argue that there is some other z𝑧z\in\mathcal{F}italic_z ∈ caligraphic_F and some other ϕDitalic-ϕ𝐷\phi\in Ditalic_ϕ ∈ italic_D such that z,ϕ>r,𝑧italic-ϕ𝑟\langle z,\phi\rangle>r,⟨ italic_z , italic_ϕ ⟩ > italic_r , which violates the assumption.

We first take the case of r>0.𝑟0r>0.italic_r > 0 . Observe that if any point of D𝐷Ditalic_D has u2subscript𝑢2u_{2}italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT coordinate greater than 1111, then we immediately have a contradiction, since then for such a point ϕ,ϕ,xt>θ~t,xtitalic-ϕitalic-ϕsubscript𝑥𝑡subscript~𝜃𝑡subscript𝑥𝑡\phi,\langle\phi,x_{t}\rangle>\langle\tilde{\theta}_{t},x_{t}\rangleitalic_ϕ , ⟨ italic_ϕ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ > ⟨ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩. But, since θ~t=(0,1)D,subscript~𝜃𝑡01𝐷\tilde{\theta}_{t}=(0,1)\in D,over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( 0 , 1 ) ∈ italic_D , it follows that the ellipse D𝐷Ditalic_D is tangent to u2=1subscript𝑢21u_{2}=1italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1. But this means that for small ε,𝜀\varepsilon,italic_ε , D𝐷Ditalic_D must contain points ϕε=(ε,1f(ε))subscriptitalic-ϕ𝜀𝜀1𝑓𝜀\phi_{\varepsilon}=(\varepsilon,1-f(\varepsilon))italic_ϕ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT = ( italic_ε , 1 - italic_f ( italic_ε ) ) where 0f(ε)=O(ε2)0𝑓𝜀𝑂superscript𝜀20\leq f(\varepsilon)=O(\varepsilon^{2})0 ≤ italic_f ( italic_ε ) = italic_O ( italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). But this implies a contradiction - indeed, take ε>0,𝜀0\varepsilon>0,italic_ε > 0 , and consider zε=(ε1/2,r).subscript𝑧𝜀superscript𝜀12𝑟z_{\varepsilon}=(\varepsilon^{1/2},r).italic_z start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT = ( italic_ε start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT , italic_r ) . Then zεsubscript𝑧𝜀z_{\varepsilon}\in\mathcal{F}italic_z start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ∈ caligraphic_F for small enough ε,𝜀\varepsilon,italic_ε , and zε,ϕεr=ε3/2rf(ε).subscript𝑧𝜀subscriptitalic-ϕ𝜀𝑟superscript𝜀32𝑟𝑓𝜀\langle z_{\varepsilon},\phi_{\varepsilon}\rangle-r=\varepsilon^{3/2}-rf(% \varepsilon).⟨ italic_z start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ⟩ - italic_r = italic_ε start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT - italic_r italic_f ( italic_ε ) . Since f(ε)=O(ε2),𝑓𝜀𝑂superscript𝜀2f(\varepsilon)=O(\varepsilon^{2}),italic_f ( italic_ε ) = italic_O ( italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , this is positive for small enough ε,𝜀\varepsilon,italic_ε , demonstrating a contradiction.

If r<0,𝑟0r<0,italic_r < 0 , the same argument can be run mutatis mutandis - now D𝐷Ditalic_D must lie above the line u2=1,subscript𝑢21u_{2}=1,italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 , but still be tangent to it, and we can develop points of the form (ε,1+f(ε))𝜀1𝑓𝜀(\varepsilon,1+f(\varepsilon))( italic_ε , 1 + italic_f ( italic_ε ) ) for 0f=O(ε2)0𝑓𝑂superscript𝜀20\leq f=O(\varepsilon^{2})0 ≤ italic_f = italic_O ( italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) in D𝐷Ditalic_D, and the analogous inner product zε,ϕεr=ε+rf(ε)subscript𝑧𝜀subscriptitalic-ϕ𝜀𝑟𝜀𝑟𝑓𝜀\langle z_{\varepsilon},\phi_{\varepsilon}\rangle-r=\varepsilon+rf(\varepsilon)⟨ italic_z start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ⟩ - italic_r = italic_ε + italic_r italic_f ( italic_ε ) which is again positive for small enough ε𝜀\varepsilonitalic_ε.

Finally, we have the case r=0,𝑟0r=0,italic_r = 0 , wherein xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT lies at the origin. But in this case any point in D𝐷Ditalic_D of non-zero u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT coordinate serves as a contradiction (since either (p,0)𝑝0(p,0)( italic_p , 0 ) or (0,q)0𝑞(0,q)( 0 , italic_q ) will yield a positive inner product).

Together, the above paragraphs imply that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT cannot lie on the interior of an edge of 𝒫𝒫\mathcal{P}caligraphic_P. But this argument generalises to the interior of any non-trivial face. Indeed, since θ~tsubscript~𝜃𝑡\tilde{\theta}_{t}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must be orthogonal to the affine subspace formed by this face, we can argue that there must be a point in the interior of a 1-D face (that forms a boundary of the larger face) that must also attain the optimal value for θ~,x~𝜃𝑥\langle\tilde{\theta},x\rangle⟨ over~ start_ARG italic_θ end_ARG , italic_x ⟩, and then run the above argument for this point. It follows that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT cannot lie in the interior of any non-trivial face of 𝒫𝒫\mathcal{P}caligraphic_P. ∎

The above argument is not restricted to confidence ellipsoids of the form of §3.1, but extends to any 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with a smooth and convex boundary. Indeed, this further extends to convex 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with continuous boundaries, barring the case where θ~tsubscript~𝜃𝑡\tilde{\theta}_{t}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is itself the extreme point of a polytope (with large ‘curvature’ at θ~tsubscript~𝜃𝑡\tilde{\theta}_{t}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT). In such a case the property that f(ε)=O(ε2)𝑓𝜀𝑂superscript𝜀2f(\varepsilon)=O(\varepsilon^{2})italic_f ( italic_ε ) = italic_O ( italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) does not hold, and a more global argument may be needed. One attack may pass through the use of continuous noise, in which case the confidence sets would almost surely not produce any extreme points that are orthogonal to the faces of a polytope (since such directions lie in a union of a finite number of dimension d1𝑑1d-1italic_d - 1 affine subspaces, which in turn is Lebesgue null), and so we may almost surely avoid this disadvantageous case.

Let us also note the following interesting observation that can also be inferred using Lemma C.1, and further characterises the behaviour of doubly-optimistic play.

Proposition C.2.

Suppose that all confidence sets are valid. Then there exists at least one BIS I𝐼Iitalic_I that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT noisily activates, and such that A(I)xtα(I)𝐴𝐼subscript𝑥𝑡𝛼𝐼A(I)x_{t}\geq\alpha(I)italic_A ( italic_I ) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_α ( italic_I ).

In other words, for at least one BIS, the action xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT not only noisily activates it, it further either activates it or violates all of the true constraints of this BIS. Notice that if the BIS shown to exist above has at least one unknown constraint, then this basically means that doss must violate safety (since meeting this with equality for the unknown constraint would be rare).

Proof of Proposition C.2.

Fix xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We call A~𝓒t~𝐴subscript𝓒𝑡\tilde{A}\in\boldsymbol{\mathcal{{C}}}_{t}over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT a witness for xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT if A~xtα,~𝐴subscript𝑥𝑡𝛼\tilde{A}x_{t}\leq\alpha,over~ start_ARG italic_A end_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_α , i.e., if A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG witnesses the presence of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in 𝒮~tsubscript~𝒮𝑡\widetilde{\mathcal{S}}_{t}over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Since xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the optimistic optimum over the entirety of 𝒮~t,subscript~𝒮𝑡\widetilde{\mathcal{S}}_{t},over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , it follows that for every witness A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG of xt,subscript𝑥𝑡x_{t},italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , it holds that xtmissingargmaxxmaxθ~𝒞tθθ~,x:A~xα.:subscript𝑥𝑡missing𝑎𝑟𝑔𝑚𝑎subscript𝑥𝑥subscript~𝜃superscriptsubscript𝒞𝑡𝜃~𝜃𝑥~𝐴𝑥𝛼x_{t}\in\mathop{\mathrm{missing}}{arg\,max}_{x}\max_{\tilde{\theta}\in\mathcal% {C}_{t}^{\theta}}\langle\tilde{\theta},x\rangle:\tilde{A}x\leq\alpha.italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ roman_missing italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟨ over~ start_ARG italic_θ end_ARG , italic_x ⟩ : over~ start_ARG italic_A end_ARG italic_x ≤ italic_α .

Now, let I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be all of the constraints that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT noisily activates, and let I:={i[1:m]:ai,xtαi}I_{\geq}:=\{i\in[1:m]:\langle a^{i},x_{t}\rangle\geq\alpha^{i}\}italic_I start_POSTSUBSCRIPT ≥ end_POSTSUBSCRIPT := { italic_i ∈ [ 1 : italic_m ] : ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT }. We claim that |I|d,subscript𝐼𝑑|I_{\geq}|\geq d,| italic_I start_POSTSUBSCRIPT ≥ end_POSTSUBSCRIPT | ≥ italic_d , which suffices to show the claim.

For the sake of contradiction, assume that |I|d1subscript𝐼𝑑1|I_{\geq}|\leq d-1| italic_I start_POSTSUBSCRIPT ≥ end_POSTSUBSCRIPT | ≤ italic_d - 1. For each iI0I𝑖subscript𝐼0subscript𝐼i\in I_{0}\setminus I_{\geq}italic_i ∈ italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∖ italic_I start_POSTSUBSCRIPT ≥ end_POSTSUBSCRIPT, we have ai,xt<αisuperscript𝑎𝑖subscript𝑥𝑡superscript𝛼𝑖\langle a^{i},x_{t}\rangle<\alpha^{i}⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ < italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Let us form the matrix A~<subscript~𝐴\tilde{A}_{<}over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT < end_POSTSUBSCRIPT formed by taking each of the i𝑖iitalic_ith rows in A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG for which iI0I,𝑖subscript𝐼0subscript𝐼i\in I_{0}\setminus I_{\geq},italic_i ∈ italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∖ italic_I start_POSTSUBSCRIPT ≥ end_POSTSUBSCRIPT , and replacing the a~isuperscript~𝑎𝑖\tilde{a}^{i}over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT in the row by a~<i=aisuperscriptsubscript~𝑎𝑖superscript𝑎𝑖\tilde{a}_{<}^{i}=a^{i}over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT < end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. This matrix remains a witness, since the resulting A~<subscript~𝐴\tilde{A}_{<}over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT < end_POSTSUBSCRIPT lies in 𝓒tsubscript𝓒𝑡\boldsymbol{\mathcal{{C}}}_{t}bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (as we have replaced rows by the rows of A𝐴Aitalic_A, each of which lie in the corresponding confidence sets for individual rows), and by definition for each replaced row, a~<i,xt<αisuperscriptsubscript~𝑎𝑖subscript𝑥𝑡superscript𝛼𝑖\langle\tilde{a}_{<}^{i},x_{t}\rangle<\alpha^{i}⟨ over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT < end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ < italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, since each such i𝑖iitalic_i lies in I0Isubscript𝐼0subscript𝐼I_{0}\setminus I_{\geq}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∖ italic_I start_POSTSUBSCRIPT ≥ end_POSTSUBSCRIPT.

Then xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT lies in the interior of the polytope 𝒫<:={x:A~<xα},assignsubscript𝒫conditional-set𝑥subscript~𝐴𝑥𝛼\mathcal{P}_{<}:=\{x:\tilde{A}_{<}x\leq\alpha\},caligraphic_P start_POSTSUBSCRIPT < end_POSTSUBSCRIPT := { italic_x : over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT < end_POSTSUBSCRIPT italic_x ≤ italic_α } , since by construction it activates at most |I|d1subscript𝐼𝑑1|I_{\geq}|\leq d-1| italic_I start_POSTSUBSCRIPT ≥ end_POSTSUBSCRIPT | ≤ italic_d - 1 constraints of this matrix. But since A~<𝓒t,subscript~𝐴subscript𝓒𝑡\tilde{A}_{<}\in\boldsymbol{\mathcal{{C}}}_{t},over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT < end_POSTSUBSCRIPT ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , it holds that 𝒫<𝒮~tsubscript𝒫subscript~𝒮𝑡\mathcal{P}_{<}\subset\widetilde{\mathcal{S}}_{t}caligraphic_P start_POSTSUBSCRIPT < end_POSTSUBSCRIPT ⊂ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and thus the algorithm plays in the intrior of a polytope contained in the permissible set, contradicting Lemma C.1. Therefore, our hypothesis is untenable, and |I|dsubscript𝐼𝑑|I_{\geq}|\geq d| italic_I start_POSTSUBSCRIPT ≥ end_POSTSUBSCRIPT | ≥ italic_d. ∎

Appendix D Controlling the Play of Suboptimal BISs

We now show the noise scale lower bound, and the subsequent control on the play of suboptimal BISs as discussed in §6.

D.1 Localising Actions when a BIS is Activated

We show Lemma 6.6 as a simple consequence of consistency and optimism.

Proof of Lemma 6.6.

Suppose that the confidence sets are consistent, and that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT noisily activates the BIS I𝐼Iitalic_I. Since xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the action of doss, it is also permissible. Together, these two properties imply that there exists some A~𝓒t~𝐴subscript𝓒𝑡\tilde{A}\in\boldsymbol{\mathcal{{C}}}_{t}over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that

A~xt~𝐴subscript𝑥𝑡\displaystyle\tilde{A}x_{t}over~ start_ARG italic_A end_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT αabsent𝛼\displaystyle\leq\alpha≤ italic_α
A~(I)xt~𝐴𝐼subscript𝑥𝑡\displaystyle\tilde{A}(I)x_{t}over~ start_ARG italic_A end_ARG ( italic_I ) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =α(I)absent𝛼𝐼\displaystyle=\alpha(I)= italic_α ( italic_I )

But, since 𝓒tsubscript𝓒𝑡\boldsymbol{\mathcal{{C}}}_{t}bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is consistent, Lemma 3.2 yields

Axtρt𝟏UA~xtAxt+ρt𝟏U.𝐴subscript𝑥𝑡subscript𝜌𝑡subscript1𝑈~𝐴subscript𝑥𝑡𝐴subscript𝑥𝑡subscript𝜌𝑡subscript1𝑈Ax_{t}-\rho_{t}\mathbf{1}_{U}\leq\tilde{A}x_{t}\leq Ax_{t}+\rho_{t}\mathbf{1}_% {U}.italic_A italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ≤ over~ start_ARG italic_A end_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_A italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT .

The claim follows directly from this, since

αA~xtAxtρt𝟏UAxtα+ρt𝟏U,𝛼~𝐴subscript𝑥𝑡𝐴subscript𝑥𝑡subscript𝜌𝑡subscript1𝑈𝐴subscript𝑥𝑡𝛼subscript𝜌𝑡subscript1𝑈\alpha\geq\tilde{A}x_{t}\geq Ax_{t}-\rho_{t}\mathbf{1}_{U}\implies Ax_{t}\leq% \alpha+\rho_{t}\mathbf{1}_{U},italic_α ≥ over~ start_ARG italic_A end_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_A italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ⟹ italic_A italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_α + italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ,

and

α(I)=A~(I)xA(I)xt+ρt𝟏U(I)A(I)xtα(I)ρt𝟏U(I).𝛼𝐼~𝐴𝐼𝑥𝐴𝐼subscript𝑥𝑡subscript𝜌𝑡subscript1𝑈𝐼𝐴𝐼subscript𝑥𝑡𝛼𝐼subscript𝜌𝑡subscript1𝑈𝐼\alpha(I)=\tilde{A}(I)x\leq A(I)x_{t}+\rho_{t}\mathbf{1}_{U}(I)\implies A(I)x_% {t}\geq\alpha(I)-\rho_{t}\mathbf{1}_{U}(I).italic_α ( italic_I ) = over~ start_ARG italic_A end_ARG ( italic_I ) italic_x ≤ italic_A ( italic_I ) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( italic_I ) ⟹ italic_A ( italic_I ) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_α ( italic_I ) - italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( italic_I ) .

Further, due to the optimistic selection of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, it is a maximiser amongst the permissible set of maxθ~𝒞tθθ~,xsubscript~𝜃superscriptsubscript𝒞𝑡𝜃~𝜃𝑥\max_{\tilde{\theta}\in\mathcal{C}_{t}^{\theta}}\langle\tilde{\theta},x\rangleroman_max start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟨ over~ start_ARG italic_θ end_ARG , italic_x ⟩. But under consistency, θ𝒞tθ,𝜃superscriptsubscript𝒞𝑡𝜃\theta\in\mathcal{C}_{t}^{\theta},italic_θ ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT , and x𝒮~tsuperscript𝑥subscript~𝒮𝑡x^{*}\in\widetilde{\mathcal{S}}_{t}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Thus, it follows that if θ~~𝜃\tilde{\theta}over~ start_ARG italic_θ end_ARG is the optimal choice in the above program, then

θ~,xtθ,x.~𝜃subscript𝑥𝑡𝜃superscript𝑥\langle\tilde{\theta},x_{t}\rangle\geq\langle\theta,x^{*}\rangle.⟨ over~ start_ARG italic_θ end_ARG , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ .

But, again using consistency and Lemma 3.2, it holds that θ~,xtθ,xt+ρt,~𝜃subscript𝑥𝑡𝜃subscript𝑥𝑡subscript𝜌𝑡\langle\tilde{\theta},x_{t}\rangle\leq\langle\theta,x_{t}\rangle+\rho_{t},⟨ over~ start_ARG italic_θ end_ARG , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≤ ⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , from which the claim is forthcoming. ∎

D.2 Proof of Noise Scale Lower Bound and the Positivity of the Gaps of Suboptimal BISs

The argument underlying the proof of the noise-scale lower bound is essentially encapsulated in §6.2.1, but refined through the use of the LP P(ζ;I)𝑃𝜁𝐼P(\zeta;I)italic_P ( italic_ζ ; italic_I ). The bulk of the following proofgoes into showing that the gap we define is meaningful, i.e., that if I𝐼Iitalic_I is a suboptimal BIS, then max(ζ(I),η(I))>0subscript𝜁𝐼subscript𝜂𝐼0\max(\zeta_{*}(I),\eta_{*}(I))>0roman_max ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) , italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) > 0. This essentially boils down to showing that 𝔰(I)𝔰𝐼\mathfrak{s}(I)fraktur_s ( italic_I ) is finite for feasible BISs.

Proof of Lemma 6.9.

We will first show that under consistency of the confidence sets, playing a suboptimal BIS I𝐼Iitalic_I implies that ρt(xt;δ)max(ζ(I),η(I))subscript𝜌𝑡subscript𝑥𝑡𝛿subscript𝜁𝐼subscript𝜂𝐼\rho_{t}(x_{t};\delta)\geq\max(\zeta_{*}(I),\eta_{*}(I))italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ≥ roman_max ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) , italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ). Observe that under the assumption of consistency,

θ,xtP(ρt;I),𝜃subscript𝑥𝑡𝑃subscript𝜌𝑡𝐼\langle\theta,x_{t}\rangle\leq P(\rho_{t};I),⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≤ italic_P ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_I ) ,

since xt𝒯(ρt;I)subscript𝑥𝑡𝒯subscript𝜌𝑡𝐼x_{t}\in\mathcal{T}(\rho_{t};I)italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_T ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_I ) by Lemma 6.6. Further, by the final line of Lemma 6.6, θ,xtθ,xρt.𝜃subscript𝑥𝑡𝜃superscript𝑥subscript𝜌𝑡\langle\theta,x_{t}\rangle\geq\langle\theta,x^{*}\rangle-\rho_{t}.⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ - italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

Since xt𝒯(ρt;I),subscript𝑥𝑡𝒯subscript𝜌𝑡𝐼x_{t}\in\mathcal{T}(\rho_{t};I),italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_T ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_I ) , this set is nonempty, and therefore by definition ρtζ(I)subscript𝜌𝑡subscript𝜁𝐼\rho_{t}\geq\zeta_{*}(I)italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ). Note that if ζ(I)=,subscript𝜁𝐼\zeta_{*}(I)=\infty,italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) = ∞ , we can conclude already, since this means that ρt(xt;δ)max(ζ(I),η(I))subscript𝜌𝑡subscript𝑥𝑡𝛿subscript𝜁𝐼subscript𝜂𝐼\rho_{t}(x_{t};\delta)\geq\infty\geq\max(\zeta_{*}(I),\eta_{*}(I))italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ≥ ∞ ≥ roman_max ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) , italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ). If ζ(I)<,subscript𝜁𝐼\zeta_{*}(I)<\infty,italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) < ∞ , then by the definition of the spread 𝔰(I)𝔰𝐼\mathfrak{s}(I)fraktur_s ( italic_I ), and the efficacy separation γ(I)𝛾𝐼\gamma(I)italic_γ ( italic_I ), we have

P(ρt;I)P(ζ(I);I)+𝔰(I)(ρtζ(I))=θ,xγ(I)+𝔰(I)(ρtζ(I)).𝑃subscript𝜌𝑡𝐼𝑃subscript𝜁𝐼𝐼𝔰𝐼subscript𝜌𝑡subscript𝜁𝐼𝜃superscript𝑥𝛾𝐼𝔰𝐼subscript𝜌𝑡subscript𝜁𝐼P(\rho_{t};I)\leq P(\zeta_{*}(I);I)+\mathfrak{s}(I)(\rho_{t}-\zeta_{*}(I))=% \langle\theta,x^{*}\rangle-\gamma(I)+\mathfrak{s}(I)(\rho_{t}-\zeta_{*}(I)).italic_P ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_I ) ≤ italic_P ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ; italic_I ) + fraktur_s ( italic_I ) ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) = ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ - italic_γ ( italic_I ) + fraktur_s ( italic_I ) ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) .

But then we conclude that

ρtγ(I)+𝔰(I)(ρtζ(I))ρt(𝔰(I)+1)γ(I)+𝔰(I)ζ(I)ρtη(I).iffsubscript𝜌𝑡𝛾𝐼𝔰𝐼subscript𝜌𝑡superscript𝜁𝐼subscript𝜌𝑡𝔰𝐼1𝛾𝐼𝔰𝐼subscript𝜁𝐼iffsubscript𝜌𝑡subscript𝜂𝐼-\rho_{t}\leq-\gamma(I)+\mathfrak{s}(I)(\rho_{t}-\zeta^{*}(I))\iff\rho_{t}(% \mathfrak{s}(I)+1)\geq\gamma(I)+\mathfrak{s}(I)\zeta_{*}(I)\iff\rho_{t}\geq% \eta_{*}(I).- italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ - italic_γ ( italic_I ) + fraktur_s ( italic_I ) ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ζ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_I ) ) ⇔ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( fraktur_s ( italic_I ) + 1 ) ≥ italic_γ ( italic_I ) + fraktur_s ( italic_I ) italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ⇔ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) .

Thus, the claim follows.

We now proceed to argue that for any suboptimal BIS I𝐼Iitalic_I, at least one of ζ(I)subscript𝜁𝐼\zeta_{*}(I)italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) and η(I)subscript𝜂𝐼\eta_{*}(I)italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) is positive. Fix the BIS I𝐼Iitalic_I. Note that if ζ(I)=,subscript𝜁𝐼\zeta_{*}(I)=\infty,italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) = ∞ , then there is nothing to show. So, suppose ζ(I)<subscript𝜁𝐼\zeta_{*}(I)<\inftyitalic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) < ∞. By expanding out the definition of 𝒯(ζ;I),𝒯𝜁𝐼\mathcal{T}(\zeta;I),caligraphic_T ( italic_ζ ; italic_I ) , the program P𝑃Pitalic_P is

P(ζ;I)=maxx𝑃𝜁𝐼subscript𝑥\displaystyle P(\zeta;I)=\max_{x}italic_P ( italic_ζ ; italic_I ) = roman_max start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT θ,x𝜃𝑥\displaystyle\langle\theta,x\rangle⟨ italic_θ , italic_x ⟩
s.t. Axα+ζ𝟏U𝐴𝑥𝛼𝜁subscript1𝑈\displaystyle Ax\leq\alpha+\zeta\mathbf{1}_{U}italic_A italic_x ≤ italic_α + italic_ζ bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT
\displaystyle-- A(I)xα(I)+ζ𝟏U(I).𝐴𝐼𝑥𝛼𝐼𝜁subscript1𝑈𝐼\displaystyle A(I)x\leq-\alpha(I)+\zeta\mathbf{1}_{U}(I).italic_A ( italic_I ) italic_x ≤ - italic_α ( italic_I ) + italic_ζ bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( italic_I ) .

We recall that this is a linear program, which is of course evident in the above. Since ζ(I)<,subscript𝜁𝐼\zeta_{*}(I)<\infty,italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) < ∞ , the above program is feasible for ζζ(I)𝜁subscript𝜁𝐼\zeta\geq\zeta_{*}(I)italic_ζ ≥ italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ). Further, since 𝒳𝒯(ζ;I)𝒯𝜁𝐼𝒳\mathcal{X}\supset\mathcal{T}(\zeta;I)caligraphic_X ⊃ caligraphic_T ( italic_ζ ; italic_I ) is a bounded polytope, the program is finite. Thus strong duality applies to the above program.

Let us introduce dual variables (λ,μ)𝜆𝜇(\lambda,\mu)( italic_λ , italic_μ ) respectively for the two blocks of constraints. By standard techniques, the dual program is

D(ζ;I)=minλ,μ𝐷𝜁𝐼subscript𝜆𝜇\displaystyle D(\zeta;I)=\min_{\lambda,\mu}italic_D ( italic_ζ ; italic_I ) = roman_min start_POSTSUBSCRIPT italic_λ , italic_μ end_POSTSUBSCRIPT λ,α+ζ𝟏U+μ,α(I)+ζ𝟏U(I)𝜆𝛼𝜁subscript1𝑈𝜇𝛼𝐼𝜁subscript1𝑈𝐼\displaystyle\langle\lambda,\alpha+\zeta\mathbf{1}_{U}\rangle+\langle\mu,-% \alpha(I)+\zeta\mathbf{1}_{U}(I)\rangle⟨ italic_λ , italic_α + italic_ζ bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ⟩ + ⟨ italic_μ , - italic_α ( italic_I ) + italic_ζ bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( italic_I ) ⟩
s.t. AλA(I)μ=θ,superscript𝐴top𝜆𝐴superscript𝐼top𝜇𝜃\displaystyle A^{\top}\lambda-A(I)^{\top}\mu=\theta,italic_A start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_λ - italic_A ( italic_I ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_μ = italic_θ ,
λ0,μ0.formulae-sequence𝜆0𝜇0\displaystyle\lambda\geq 0,\mu\geq 0.italic_λ ≥ 0 , italic_μ ≥ 0 .

For succinctness, let us write

f(λ,μ)𝑓𝜆𝜇\displaystyle f(\lambda,\mu)italic_f ( italic_λ , italic_μ ) =λ,𝟏U+μ,𝟏U(I)absent𝜆subscript1𝑈𝜇subscript1𝑈𝐼\displaystyle=\langle\lambda,\mathbf{1}_{U}\rangle+\langle\mu,\mathbf{1}_{U}(I)\rangle= ⟨ italic_λ , bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ⟩ + ⟨ italic_μ , bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( italic_I ) ⟩
g(λ,μ)𝑔𝜆𝜇\displaystyle g(\lambda,\mu)italic_g ( italic_λ , italic_μ ) =λ,α+ζ(I)𝟏U+μ,α(I)+ζ(I)𝟏U(I),absent𝜆𝛼subscript𝜁𝐼subscript1𝑈𝜇𝛼𝐼subscript𝜁𝐼subscript1𝑈𝐼\displaystyle=\langle\lambda,\alpha+\zeta_{*}(I)\mathbf{1}_{U}\rangle+\langle% \mu,-\alpha(I)+\zeta_{*}(I)\mathbf{1}_{U}(I)\rangle,= ⟨ italic_λ , italic_α + italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ⟩ + ⟨ italic_μ , - italic_α ( italic_I ) + italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( italic_I ) ⟩ ,
h(λ,μ)𝜆𝜇\displaystyle h(\lambda,\mu)italic_h ( italic_λ , italic_μ ) :=AλA(I)μθ.assignabsentsuperscript𝐴top𝜆𝐴superscript𝐼top𝜇𝜃\displaystyle:=A^{\top}\lambda-A(I)^{\top}\mu-\theta.:= italic_A start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_λ - italic_A ( italic_I ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_μ - italic_θ .

Further, let 𝝀=(λ,μ)𝝀𝜆𝜇\boldsymbol{\lambda}=(\lambda,\mu)bold_italic_λ = ( italic_λ , italic_μ ). We can succinctly write the dual as

D(ζ;I)=min𝝀(ζζ(I))f(𝝀)+g(𝝀):h(𝝀)=0,λ0,μ0.:𝐷𝜁𝐼subscript𝝀𝜁subscript𝜁𝐼𝑓𝝀𝑔𝝀formulae-sequence𝝀0formulae-sequence𝜆0𝜇0D(\zeta;I)=\min_{\boldsymbol{\lambda}}(\zeta-\zeta_{*}(I))f(\boldsymbol{% \lambda})+g(\boldsymbol{\lambda}):h(\boldsymbol{\lambda})=0,\lambda\geq 0,\mu% \geq 0.italic_D ( italic_ζ ; italic_I ) = roman_min start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_ζ - italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) italic_f ( bold_italic_λ ) + italic_g ( bold_italic_λ ) : italic_h ( bold_italic_λ ) = 0 , italic_λ ≥ 0 , italic_μ ≥ 0 .

Note that since the primal is bounded and feasible for ζζ(I),𝜁subscript𝜁𝐼\zeta\geq\zeta_{*}(I),italic_ζ ≥ italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) , so is the dual, and by strong duality D(ζ(I);I)=P(ζ(I);I).𝐷subscript𝜁𝐼𝐼𝑃subscript𝜁𝐼𝐼D(\zeta_{*}(I);I)=P(\zeta_{*}(I);I).italic_D ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ; italic_I ) = italic_P ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ; italic_I ) . But

D(ζ(I);I)=min𝝀g(𝝀):h(𝝀)=0,λ0,μ0.:𝐷subscript𝜁𝐼𝐼subscript𝝀𝑔𝝀formulae-sequence𝝀0formulae-sequence𝜆0𝜇0D(\zeta_{*}(I);I)=\min_{\boldsymbol{\lambda}}g(\boldsymbol{\lambda}):h(% \boldsymbol{\lambda})=0,\lambda\geq 0,\mu\geq 0.italic_D ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ; italic_I ) = roman_min start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT italic_g ( bold_italic_λ ) : italic_h ( bold_italic_λ ) = 0 , italic_λ ≥ 0 , italic_μ ≥ 0 .

It follows that the set

:={𝝀:g(𝝀)P(ζ(I);I),h(𝝀)=0,λ0,μ0}assignconditional-set𝝀formulae-sequence𝑔𝝀𝑃subscript𝜁𝐼𝐼formulae-sequence𝝀0formulae-sequence𝜆0𝜇0\mathcal{F}:=\{\boldsymbol{\lambda}:g(\boldsymbol{\lambda})\leq P(\zeta_{*}(I)% ;I),h(\boldsymbol{\lambda})=0,\lambda\geq 0,\mu\geq 0\}caligraphic_F := { bold_italic_λ : italic_g ( bold_italic_λ ) ≤ italic_P ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ; italic_I ) , italic_h ( bold_italic_λ ) = 0 , italic_λ ≥ 0 , italic_μ ≥ 0 }

is nonempty. Observe that ζ𝜁\zetaitalic_ζ does not appear anywhere in the definition of \mathcal{F}caligraphic_F.

Let us define the two programs

D(ζ;I)superscript𝐷𝜁𝐼\displaystyle D^{\prime}(\zeta;I)italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_ζ ; italic_I ) :=min𝝀(ζζ(I))f(𝝀)+g(𝝀):𝝀,:assignabsentsubscript𝝀𝜁subscript𝜁𝐼𝑓𝝀𝑔𝝀𝝀\displaystyle:=\min_{\boldsymbol{\lambda}}(\zeta-\zeta_{*}(I))f(\boldsymbol{% \lambda})+g(\boldsymbol{\lambda}):\boldsymbol{\lambda}\in\mathcal{F},:= roman_min start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_ζ - italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) italic_f ( bold_italic_λ ) + italic_g ( bold_italic_λ ) : bold_italic_λ ∈ caligraphic_F ,
E(I)𝐸𝐼\displaystyle E(I)italic_E ( italic_I ) :=min𝝀f(𝝀):𝝀:assignabsentsubscript𝝀𝑓𝝀𝝀\displaystyle:=\min_{\boldsymbol{\lambda}}f(\boldsymbol{\lambda}):\boldsymbol{% \lambda}\in\mathcal{F}:= roman_min start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT italic_f ( bold_italic_λ ) : bold_italic_λ ∈ caligraphic_F

Note that both of the above programs are feasible. As a feasible minimisation program we also have that E(I)<𝐸𝐼E(I)<\inftyitalic_E ( italic_I ) < ∞. Further, since introducing extra constraints cannot decrease the value of a minimisation program, we note that D(ζ;I)D(ζ;I)𝐷𝜁𝐼superscript𝐷𝜁𝐼D(\zeta;I)\leq D^{\prime}(\zeta;I)italic_D ( italic_ζ ; italic_I ) ≤ italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_ζ ; italic_I ). But observe that since the constraints of D(ζ;I)superscript𝐷𝜁𝐼D^{\prime}(\zeta;I)italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_ζ ; italic_I ) include the requirement that g(𝝀)P(ζ(I);I),𝑔𝝀𝑃subscript𝜁𝐼𝐼g(\boldsymbol{\lambda})\leq P(\zeta_{*}(I);I),italic_g ( bold_italic_λ ) ≤ italic_P ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ; italic_I ) , we have for every ζζ(I)𝜁subscript𝜁𝐼\zeta\geq\zeta_{*}(I)italic_ζ ≥ italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) that

D(ζ;I)superscript𝐷𝜁𝐼\displaystyle D^{\prime}(\zeta;I)italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_ζ ; italic_I ) P(ζ(I);I)+min{(ζζ(I))f(𝝀):𝝀}absent𝑃subscript𝜁𝐼𝐼:𝜁subscript𝜁𝐼𝑓𝝀𝝀\displaystyle\leq P(\zeta_{*}(I);I)+\min\{(\zeta-\zeta_{*}(I))f(\boldsymbol{% \lambda}):\boldsymbol{\lambda}\in\mathcal{F}\}≤ italic_P ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ; italic_I ) + roman_min { ( italic_ζ - italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) italic_f ( bold_italic_λ ) : bold_italic_λ ∈ caligraphic_F }
=P(ζ(I);I)+(ζζ(I))min{f(𝝀):𝝀}absent𝑃subscript𝜁𝐼𝐼𝜁subscript𝜁𝐼:𝑓𝝀𝝀\displaystyle=P(\zeta_{*}(I);I)+(\zeta-\zeta_{*}(I))\cdot\min\{f(\boldsymbol{% \lambda}):\boldsymbol{\lambda}\in\mathcal{F}\}= italic_P ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ; italic_I ) + ( italic_ζ - italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) ⋅ roman_min { italic_f ( bold_italic_λ ) : bold_italic_λ ∈ caligraphic_F }
=P(ζ(I);I)+(ζζ(I))E(I).absent𝑃subscript𝜁𝐼𝐼𝜁subscript𝜁𝐼𝐸𝐼\displaystyle=P(\zeta_{*}(I);I)+(\zeta-\zeta_{*}(I))E(I).= italic_P ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ; italic_I ) + ( italic_ζ - italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) italic_E ( italic_I ) .

But, then by strong duality,

P(ζ;I)=D(ζ;I)P(ζ(I);I)+(ζζ(I))E(I),𝑃𝜁𝐼𝐷𝜁𝐼𝑃subscript𝜁𝐼𝐼𝜁subscript𝜁𝐼𝐸𝐼P(\zeta;I)=D(\zeta;I)\leq P(\zeta_{*}(I);I)+(\zeta-\zeta_{*}(I))E(I),italic_P ( italic_ζ ; italic_I ) = italic_D ( italic_ζ ; italic_I ) ≤ italic_P ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ; italic_I ) + ( italic_ζ - italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) italic_E ( italic_I ) ,

and we conclude that 𝔰(I)max(0,E(I))<𝔰𝐼0𝐸𝐼\mathfrak{s}(I)\leq\max(0,E(I))<\inftyfraktur_s ( italic_I ) ≤ roman_max ( 0 , italic_E ( italic_I ) ) < ∞.

Now, since 𝔰(I)𝔰𝐼\mathfrak{s}(I)fraktur_s ( italic_I ) is finite, in order to show that max(ζ(I),η(I))>0,subscript𝜁𝐼subscript𝜂𝐼0\max(\zeta_{*}(I),\eta_{*}(I))>0,roman_max ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) , italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) > 0 , it suffices to argue that for any suboptimal BIS, max(ζ(I),γ(I))>0.subscript𝜁𝐼𝛾𝐼0\max(\zeta_{*}(I),\gamma(I))>0.roman_max ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) , italic_γ ( italic_I ) ) > 0 . But observe that if ζ(I)=0,subscript𝜁𝐼0\zeta_{*}(I)=0,italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) = 0 , then limζ0P(ζ;I)>,subscript𝜁0𝑃𝜁𝐼\lim_{\zeta\searrow 0}P(\zeta;I)>-\infty,roman_lim start_POSTSUBSCRIPT italic_ζ ↘ 0 end_POSTSUBSCRIPT italic_P ( italic_ζ ; italic_I ) > - ∞ , and due to the right-continuity of P𝑃Pitalic_P, this implies that P(0;I)>𝒳I,𝑃0𝐼superscript𝒳𝐼P(0;I)>-\infty\implies\mathcal{X}^{I}\neq\emptyset,italic_P ( 0 ; italic_I ) > - ∞ ⟹ caligraphic_X start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ≠ ∅ , in other words, I𝐼Iitalic_I is a feasible BIS. But if a BIS I𝐼Iitalic_I is both feasible and suboptimal, then for every xI,𝑥𝐼x\in I,italic_x ∈ italic_I , it must hold that θ,x<θ,x𝜃𝑥𝜃superscript𝑥\langle\theta,x\rangle<\langle\theta,x^{*}\rangle⟨ italic_θ , italic_x ⟩ < ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩, since otherwise I𝐼Iitalic_I would be optimal. But, since 𝒳I=𝒯(0;I)superscript𝒳𝐼𝒯0𝐼\mathcal{X}^{I}=\mathcal{T}(0;I)caligraphic_X start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = caligraphic_T ( 0 ; italic_I ) is a compact set, this means that P(ζ(I);I)=P(0;I)<θ,xγ(I)>0.iff𝑃subscript𝜁𝐼𝐼𝑃0𝐼𝜃superscript𝑥𝛾𝐼0P(\zeta_{*}(I);I)=P(0;I)<\langle\theta,x^{*}\rangle\iff\gamma(I)>0.italic_P ( italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ; italic_I ) = italic_P ( 0 ; italic_I ) < ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ ⇔ italic_γ ( italic_I ) > 0 .

D.3 Bounding the Play of Suboptimal BISs

With the above ingredients in place, we show the main result of §6.3.

Proof of Theorem 6.11.

Let us again abbreviate ρt(xt;δ)subscript𝜌𝑡subscript𝑥𝑡𝛿\rho_{t}(x_{t};\delta)italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) as ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. By Lemma 6.9, if a suboptimal BIS I is played, then ρtmax(η(I),ζ(I)).subscript𝜌𝑡subscript𝜂𝐼subscript𝜁𝐼\rho_{t}\geq\max(\eta_{*}(I),\zeta_{*}(I)).italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ roman_max ( italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) , italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) . But then any time a suboptimal BIS is played, ρtmin{max(η(I),ζ(I)):I is a suboptimal BISs},subscript𝜌𝑡:subscript𝜂𝐼subscript𝜁𝐼𝐼 is a suboptimal BISs\rho_{t}\geq\min\{\max(\eta_{*}(I),\zeta_{*}(I)):I\textrm{ is a suboptimal % BISs}\},italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ roman_min { roman_max ( italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) , italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) ) : italic_I is a suboptimal BISs } , i.e., ρtΓsubscript𝜌𝑡Γ\rho_{t}\geq\Gammaitalic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ roman_Γ.

Now observe that

t=1T𝟙{suboptimal BIS I:xt𝒳~tI}superscriptsubscript𝑡1𝑇1conditional-setsuboptimal BIS 𝐼subscript𝑥𝑡superscriptsubscript~𝒳𝑡𝐼\displaystyle\sum_{t=1}^{T}\mathds{1}\{\exists\textrm{suboptimal BIS }I:x_{t}% \in\widetilde{\mathcal{X}}_{t}^{I}\}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_1 { ∃ suboptimal BIS italic_I : italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT } t=1T𝟙{ρtΓ}absentsuperscriptsubscript𝑡1𝑇1subscript𝜌𝑡Γ\displaystyle\leq\sum_{t=1}^{T}\mathds{1}\{\rho_{t}\geq\Gamma\}≤ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_1 { italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ roman_Γ }
t=1Tρt2Γ2𝟙{ρtΓ}absentsuperscriptsubscript𝑡1𝑇superscriptsubscript𝜌𝑡2superscriptΓ21subscript𝜌𝑡Γ\displaystyle\leq\sum_{t=1}^{T}\frac{\rho_{t}^{2}}{\Gamma^{2}}\mathds{1}\{\rho% _{t}\geq\Gamma\}≤ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_Γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_1 { italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ roman_Γ }
Γ2tTρt2,absentsuperscriptΓ2subscript𝑡𝑇superscriptsubscript𝜌𝑡2\displaystyle\leq\Gamma^{-2}\sum_{t\leq T}\rho_{t}^{2},≤ roman_Γ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the second inequality is using that if ρtΓ,subscript𝜌𝑡Γ\rho_{t}\geq\Gamma,italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ roman_Γ , then ρt/Γ1.subscript𝜌𝑡Γ1\rho_{t}/\Gamma\geq 1.italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / roman_Γ ≥ 1 . Applying Lemma B.2 immediately bounds the above as O(d2log2T+dlog(T)log(U/δ)Γ2).𝑂superscript𝑑2superscript2𝑇𝑑𝑇𝑈𝛿superscriptΓ2O\left(\frac{d^{2}\log^{2}T+d\log(T)\log(U/\delta)}{\Gamma^{2}}\right).italic_O ( divide start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T + italic_d roman_log ( italic_T ) roman_log ( italic_U / italic_δ ) end_ARG start_ARG roman_Γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) .

Appendix E Proofs of Bounds on Efficacy Regret and Safety Violations

We proceed to discuss the proofs of the results of §7.

E.1 The Efficacy of the Actions of doss when Activating Optimal BISs

Our first order of business is to argue that playing only optimal BISs leads to actions xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that are ‘over-efficient’, i.e., satisfy θ,xtθ,x𝜃subscript𝑥𝑡𝜃superscript𝑥\langle\theta,x_{t}\rangle\geq\langle\theta,x^{*}\rangle⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩. The following basic result is useful in our argument.

Lemma E.1.

For a BIS I𝐼Iitalic_I, define KI=I[U+1:m]K_{I}=I\cap[U+1:m]italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = italic_I ∩ [ italic_U + 1 : italic_m ] to be the indices of the known constraints in I𝐼Iitalic_I. Under the genericity of noise assumption, for any BIS I𝐼Iitalic_I such that A(KI)𝐴subscript𝐾𝐼A(K_{I})italic_A ( italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) is full row rank, for any td,𝑡𝑑t\geq d,italic_t ≥ italic_d , it holds almost surely that A^t(I)subscript^𝐴𝑡𝐼\hat{A}_{t}(I)over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_I ) is full rank.

Proof of Lemma E.1.

Notice that since for any i𝑖iitalic_i, the noise in the feedback Stisuperscriptsubscript𝑆𝑡𝑖S_{t}^{i}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is generic, it does not concentrate in any low-dimensional subspace of dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. This in turn means that the probability that any a^tisubscriptsuperscript^𝑎𝑖𝑡\hat{a}^{i}_{t}over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT lies in a low-dimensional subspace of dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is exactly zero. The claim follows immediately: since |IKI|d,𝐼subscript𝐾𝐼𝑑|I\setminus K_{I}|\leq d,| italic_I ∖ italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | ≤ italic_d , each a^tisubscriptsuperscript^𝑎𝑖𝑡\hat{a}^{i}_{t}over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with probability one does not lie in the span of {a^j}jI{i},subscriptsuperscript^𝑎𝑗𝑗𝐼𝑖\{\hat{a}^{j}\}_{j\in I\setminus\{i\}},{ over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j ∈ italic_I ∖ { italic_i } end_POSTSUBSCRIPT , and since by assumption the A(KI)𝐴subscript𝐾𝐼A(K_{I})italic_A ( italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) is full rank. ∎

With this in hand, we argue Lemma 7.2 by exploiting the weak-nondegeneracy condition of Assumption 7.1.

Proof of Lemma 7.2.

We need to show that if all of the BISs xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT noisily activates are optimal, then θ,xtθ,x𝜃subscript𝑥𝑡𝜃superscript𝑥\langle\theta,x_{t}\rangle\geq\langle\theta,x^{*}\rangle⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩, which comprises the bulk of this proof. To this end, let us fix one such BIS, I𝐼Iitalic_I.

By Assumption 7.1, we know that {x}=𝒳Isuperscript𝑥superscript𝒳𝐼\{x^{*}\}=\mathcal{X}^{I}{ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } = caligraphic_X start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, and that I𝐼Iitalic_I is full-rank. Notice that as a result, we may write

θ,x=maxθ,x:A(I)x=α(I).:𝜃superscript𝑥𝜃𝑥𝐴𝐼𝑥𝛼𝐼\langle\theta,x^{*}\rangle=\max\langle\theta,x\rangle:A(I)x=\alpha(I).⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ = roman_max ⟨ italic_θ , italic_x ⟩ : italic_A ( italic_I ) italic_x = italic_α ( italic_I ) .

Indeed, due to the fact that I𝐼Iitalic_I is full rank, the latter equality constraints already enforce that xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the sole feasible point. Further, by strong duality, there exists a choice of vectors μ𝜇\muitalic_μ such that

μA(I)=θ.superscript𝜇top𝐴𝐼superscript𝜃top\mu^{\top}A(I)=\theta^{\top}.italic_μ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_A ( italic_I ) = italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

Due to the optimistic selection rule, and the fact that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT noisily saturates I𝐼Iitalic_I, it must hold that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a solution to

maxθ~𝒞tθ,A~𝓒tmaxxθ~,x:A~(I)x=α(I),A~xα,:subscriptformulae-sequence~𝜃superscriptsubscript𝒞𝑡𝜃~𝐴subscript𝓒𝑡subscript𝑥~𝜃𝑥formulae-sequence~𝐴𝐼𝑥𝛼𝐼~𝐴𝑥𝛼\max_{\tilde{\theta}\in\mathcal{C}_{t}^{\theta},\tilde{A}\in\boldsymbol{% \mathcal{{C}}}_{t}}\max_{x}\langle\tilde{\theta},x\rangle:\tilde{A}(I)x=\alpha% (I),\tilde{A}x\leq\alpha,roman_max start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⟨ over~ start_ARG italic_θ end_ARG , italic_x ⟩ : over~ start_ARG italic_A end_ARG ( italic_I ) italic_x = italic_α ( italic_I ) , over~ start_ARG italic_A end_ARG italic_x ≤ italic_α ,

where the maximisation over A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG is equivalent to optimisitic selection over 𝒮~t={x:A~:A~xα}subscript~𝒮𝑡conditional-set𝑥:~𝐴~𝐴𝑥𝛼\widetilde{\mathcal{S}}_{t}=\{x:\exists\tilde{A}:\tilde{A}x\leq\alpha\}over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_x : ∃ over~ start_ARG italic_A end_ARG : over~ start_ARG italic_A end_ARG italic_x ≤ italic_α }, and the equality constraint arises since xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT noisily activates I𝐼Iitalic_I. Now observe that in the optimisation above, we may restrict attention to A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG such that A~(I)~𝐴𝐼\tilde{A}(I)over~ start_ARG italic_A end_ARG ( italic_I ) is full rank. Indeed, if this optimal choice were rank-deficient, then since the feasible set remains a polytope, there must exist some other constraints amongst the A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG besides those in I𝐼Iitalic_I that are activated by xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (since otherwise we would be playing on the interior of a polytope, and thus violating Lemma C.1). By drop** some linearly dependent rows, this would yield a different index set Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT activates, and which is not rank-deficient. By the hypothesis, this index set must also be optimal, and we can run the argument for Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT instead. But then note that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is exactly characterised by the equality conditions imposed by noisily activating the BIS I𝐼Iitalic_I, which means that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the optimiser of

maxθ~𝒞tθ,A~𝓒t,M~(I,A~) is full-rankmaxxθ~,x:A~(I)x=α(I).:subscriptformulae-sequence~𝜃superscriptsubscript𝒞𝑡𝜃~𝐴subscript𝓒𝑡~𝑀𝐼~𝐴 is full-ranksubscript𝑥~𝜃𝑥~𝐴𝐼𝑥𝛼𝐼\max_{\begin{subarray}{c}\tilde{\theta}\in\mathcal{C}_{t}^{\theta},\tilde{A}% \in\boldsymbol{\mathcal{{C}}}_{t},\\ \tilde{M}(I,\tilde{A})\textrm{ is full-rank}\end{subarray}}\max_{x}\langle% \tilde{\theta},x\rangle:\tilde{A}(I)x=\alpha(I).roman_max start_POSTSUBSCRIPT start_ARG start_ROW start_CELL over~ start_ARG italic_θ end_ARG ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_M end_ARG ( italic_I , over~ start_ARG italic_A end_ARG ) is full-rank end_CELL end_ROW end_ARG end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⟨ over~ start_ARG italic_θ end_ARG , italic_x ⟩ : over~ start_ARG italic_A end_ARG ( italic_I ) italic_x = italic_α ( italic_I ) .

Now, let us write A~=A+δA,θ~=θ+δθ,x=x+δx.formulae-sequence~𝐴𝐴𝛿𝐴formulae-sequence~𝜃𝜃𝛿𝜃𝑥superscript𝑥𝛿𝑥\tilde{A}=A+\delta A,\tilde{\theta}=\theta+\delta\theta,x=x^{*}+\delta x.over~ start_ARG italic_A end_ARG = italic_A + italic_δ italic_A , over~ start_ARG italic_θ end_ARG = italic_θ + italic_δ italic_θ , italic_x = italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_δ italic_x . Further denote the optima as δθt,δAt,δxt𝛿subscript𝜃𝑡𝛿subscript𝐴𝑡𝛿subscript𝑥𝑡\delta\theta_{t},\delta A_{t},\delta x_{t}italic_δ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_δ italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_δ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. With this notation, our goal is to show that θ,δxt0.𝜃𝛿subscript𝑥𝑡0\langle\theta,\delta x_{t}\rangle\geq 0.⟨ italic_θ , italic_δ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ 0 . To this end, observe that since the program above has the constraint A~(I)x=α(I)=A(I)x~𝐴𝐼𝑥𝛼𝐼𝐴𝐼superscript𝑥\tilde{A}(I)x=\alpha(I)=A(I)x^{*}over~ start_ARG italic_A end_ARG ( italic_I ) italic_x = italic_α ( italic_I ) = italic_A ( italic_I ) italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT we find that

A~(I)x=A(I)x+δA(I)x+A(I)δx=α(𝖴)~𝐴𝐼𝑥𝐴𝐼superscript𝑥𝛿𝐴𝐼𝑥𝐴𝐼𝛿𝑥𝛼𝖴\displaystyle\tilde{A}(I)x=A(I)x^{*}+\delta A(I)x+A(I)\delta x=\alpha(\mathsf{% U})over~ start_ARG italic_A end_ARG ( italic_I ) italic_x = italic_A ( italic_I ) italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_δ italic_A ( italic_I ) italic_x + italic_A ( italic_I ) italic_δ italic_x = italic_α ( sansserif_U ) A(I)δx=δA(I)x,iffabsent𝐴𝐼𝛿𝑥𝛿𝐴𝐼𝑥\displaystyle\iff A(I)\delta x=-\delta A(I)x,⇔ italic_A ( italic_I ) italic_δ italic_x = - italic_δ italic_A ( italic_I ) italic_x ,

which imply that

θ,δx𝜃𝛿𝑥\displaystyle\langle\theta,\delta x\rangle⟨ italic_θ , italic_δ italic_x ⟩ =Aμ,δx=μ,Aδx=μ,δAxabsentsuperscript𝐴top𝜇𝛿𝑥𝜇𝐴𝛿𝑥𝜇𝛿𝐴𝑥\displaystyle=\langle A^{\top}\mu,\delta x\rangle=\langle\mu,A\delta x\rangle=% -\langle\mu,\delta Ax\rangle= ⟨ italic_A start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_μ , italic_δ italic_x ⟩ = ⟨ italic_μ , italic_A italic_δ italic_x ⟩ = - ⟨ italic_μ , italic_δ italic_A italic_x ⟩
θ,δxiffabsent𝜃𝛿𝑥\displaystyle\iff\langle\theta,\delta x\rangle⇔ ⟨ italic_θ , italic_δ italic_x ⟩ =iIμiδAi,xabsentsubscript𝑖𝐼superscript𝜇𝑖𝛿superscript𝐴𝑖𝑥\displaystyle=\sum_{i\in I}-\mu^{i}\langle\delta A^{i},x\rangle= ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⟨ italic_δ italic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x ⟩ (6)

Thus, we can rewrite the program as

maxxmaxδθ,δAθ,x+δθ,xμ,δA(I)x:A~(I)x=α(I).:subscript𝑥subscript𝛿𝜃𝛿𝐴𝜃superscript𝑥𝛿𝜃𝑥𝜇𝛿𝐴𝐼𝑥~𝐴𝐼𝑥𝛼𝐼\max_{x}\max_{\delta\theta,\delta A}\langle\theta,x^{*}\rangle+\langle\delta% \theta,x\rangle-\langle\mu,\delta A(I)x\rangle:\tilde{A}(I)x=\alpha(I).roman_max start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_δ italic_θ , italic_δ italic_A end_POSTSUBSCRIPT ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ + ⟨ italic_δ italic_θ , italic_x ⟩ - ⟨ italic_μ , italic_δ italic_A ( italic_I ) italic_x ⟩ : over~ start_ARG italic_A end_ARG ( italic_I ) italic_x = italic_α ( italic_I ) .

Now, recall that the confidence sets are constructed around the RLS estimates a^tisubscriptsuperscript^𝑎𝑖𝑡\hat{a}^{i}_{t}over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and θ^t,subscript^𝜃𝑡\hat{\theta}_{t},over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , i.e.,

𝒞tθ={θ~:{θ~θ^tVt1ωt},𝒞ti={a~:a~a^tiVt1ωt𝟏Ui}.\mathcal{C}_{t}^{\theta}=\{\tilde{\theta}:\{\tilde{\theta}-\hat{\theta}_{t}\|_% {V_{t-1}}\leq\omega_{t}\},\mathcal{C}_{t}^{i}=\{\tilde{a}:\|\tilde{a}-\hat{a}^% {i}_{t}\|_{V_{t-1}}\leq\omega_{t}\mathbf{1}_{U}^{i}\}.caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT = { over~ start_ARG italic_θ end_ARG : { over~ start_ARG italic_θ end_ARG - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } , caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { over~ start_ARG italic_a end_ARG : ∥ over~ start_ARG italic_a end_ARG - over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } .

To clearly express the choice of δθ,δA,𝛿𝜃𝛿𝐴\delta\theta,\delta A,italic_δ italic_θ , italic_δ italic_A , we define

Δθt=θ^tθ,Δsubscript𝜃𝑡subscript^𝜃𝑡𝜃\displaystyle\Delta\theta_{t}=\hat{\theta}_{t}-\theta,roman_Δ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_θ , Δati=a^tiai,Δsubscriptsuperscript𝑎𝑖𝑡subscriptsuperscript^𝑎𝑖𝑡superscript𝑎𝑖\displaystyle\Delta a^{i}_{t}=\hat{a}^{i}_{t}-a^{i},roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , ΔA=A^tAΔ𝐴subscript^𝐴𝑡𝐴\displaystyle\Delta A=\hat{A}_{t}-Aroman_Δ italic_A = over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_A
θ=θ~θ^t𝜃~𝜃subscript^𝜃𝑡\displaystyle\partial\theta=\tilde{\theta}-\hat{\theta}_{t}∂ italic_θ = over~ start_ARG italic_θ end_ARG - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ai=a~ia^ti,superscript𝑎𝑖superscript~𝑎𝑖subscriptsuperscript^𝑎𝑖𝑡\displaystyle\partial a^{i}=\tilde{a}^{i}-\hat{a}^{i}_{t},∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , A=A~A^t.𝐴~𝐴subscript^𝐴𝑡\displaystyle\partial A=\tilde{A}-\hat{A}_{t}.∂ italic_A = over~ start_ARG italic_A end_ARG - over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

Observe then that

δθ=Δθt+θ;δai=Δati+ai.formulae-sequence𝛿𝜃Δsubscript𝜃𝑡𝜃𝛿superscript𝑎𝑖Δsubscriptsuperscript𝑎𝑖𝑡superscript𝑎𝑖\delta\theta=\Delta\theta_{t}+\partial\theta;\delta a^{i}=\Delta a^{i}_{t}+% \partial a^{i}.italic_δ italic_θ = roman_Δ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∂ italic_θ ; italic_δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT .

Further, the decision variables of the program are only the θ𝜃\partial\theta∂ italic_θ and aisuperscript𝑎𝑖\partial a^{i}∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPTs, which lie in the set θVt1ωtsubscriptnorm𝜃subscript𝑉𝑡1subscript𝜔𝑡\|\partial\theta\|_{V_{t-1}}\leq\omega_{t}∥ ∂ italic_θ ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and aiVt1ωt𝟏Ui.subscriptnormsuperscript𝑎𝑖subscript𝑉𝑡1subscript𝜔𝑡superscriptsubscript1𝑈𝑖\|\partial a^{i}\|_{V_{t-1}}\leq\omega_{t}\mathbf{1}_{U}^{i}.∥ ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT . Let us denote UI=I[1:U]U_{I}=I\cap[1:U]italic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = italic_I ∩ [ 1 : italic_U ] and KI=I[U+1:m]K_{I}=I\cap[U+1:m]italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = italic_I ∩ [ italic_U + 1 : italic_m ], and observe that Δai=ai=0Δsuperscript𝑎𝑖superscript𝑎𝑖0\Delta a^{i}=\partial a^{i}=0roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = 0 for iKI𝑖subscript𝐾𝐼i\in K_{I}italic_i ∈ italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT. Incorporating this structure, we can write the program as

θ,x+maxxmaxθ,A𝜃superscript𝑥subscript𝑥subscript𝜃𝐴\displaystyle\langle\theta,x^{*}\rangle+\max_{x}\max_{\partial\theta,\partial A}⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ + roman_max start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT ∂ italic_θ , ∂ italic_A end_POSTSUBSCRIPT Δθt,x+iUIμiΔati,x+θ,xiUIμiai,x.\displaystyle\langle\Delta\theta_{t},x\rangle+-\sum_{i\in U_{I}}\mu^{i}\langle% \Delta a^{i}_{t},x\rangle+\langle\partial\theta,x\rangle-\sum_{i\in U_{I}}\mu_% {i}\langle\partial a^{i},x\rangle.⟨ roman_Δ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x ⟩ + - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⟨ roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x ⟩ + ⟨ ∂ italic_θ , italic_x ⟩ - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x ⟩ .
s.t. ai+Δati,x+ai,x=αiiI,formulae-sequencesuperscript𝑎𝑖Δsubscriptsuperscript𝑎𝑖𝑡𝑥superscript𝑎𝑖𝑥superscript𝛼𝑖for-all𝑖𝐼\displaystyle\langle a^{i}+\Delta a^{i}_{t},x\rangle+\langle\partial a^{i},x% \rangle=\alpha^{i}\quad\forall i\in I,⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x ⟩ + ⟨ ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x ⟩ = italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∀ italic_i ∈ italic_I ,
θVt1ωtsubscriptnorm𝜃subscript𝑉𝑡1subscript𝜔𝑡\displaystyle\|\partial\theta\|_{V_{t-1}}\leq\omega_{t}∥ ∂ italic_θ ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
aiVt1ωt𝟏UiiI.formulae-sequencesubscriptnormsuperscript𝑎𝑖subscript𝑉𝑡1subscript𝜔𝑡superscriptsubscript1𝑈𝑖for-all𝑖𝐼\displaystyle\|\partial a^{i}\|_{V_{t-1}}\leq\omega_{t}\mathbf{1}_{U}^{i}\quad% \forall i\in I.∥ ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∀ italic_i ∈ italic_I .

But now observe that the optimal choice of θ𝜃\partial\theta∂ italic_θ in the above is exactly ωt/xVt11Vt11xsubscript𝜔𝑡subscriptnorm𝑥superscriptsubscript𝑉𝑡11superscriptsubscript𝑉𝑡11𝑥\omega_{t}/\|x\|_{V_{t-1}^{-1}}V_{t-1}^{-1}xitalic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x. Indeed, recall that uVt1=uVt1u=Vt11/2usubscriptnorm𝑢subscript𝑉𝑡1superscript𝑢topsubscript𝑉𝑡1𝑢normsuperscriptsubscript𝑉𝑡112𝑢\|u\|_{V_{t-1}}=\sqrt{u^{\top}V_{t-1}u}=\|V_{t-1}^{1/2}u\|∥ italic_u ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = square-root start_ARG italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT italic_u end_ARG = ∥ italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_u ∥, and similarly uVt11=Vt11/2usubscriptnorm𝑢superscriptsubscript𝑉𝑡11normsuperscriptsubscript𝑉𝑡112𝑢\|u\|_{V_{t-1}^{-1}}=\|V_{t-1}^{-1/2}u\|∥ italic_u ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ∥ italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_u ∥. By the Cauchy-Schwarz inequality, θ,x=Vt11/2θ,Vt11/2xθVt1xVt11,𝜃𝑥superscriptsubscript𝑉𝑡112𝜃superscriptsubscript𝑉𝑡112𝑥subscriptnorm𝜃subscript𝑉𝑡1subscriptnorm𝑥superscriptsubscript𝑉𝑡11\langle\partial\theta,x\rangle=\langle V_{t-1}^{1/2}\partial\theta,V_{t-1}^{-1% /2}x\rangle\leq\|\partial\theta\|_{V_{t-1}}\|x\|_{V_{t-1}^{-1}},⟨ ∂ italic_θ , italic_x ⟩ = ⟨ italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ∂ italic_θ , italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_x ⟩ ≤ ∥ ∂ italic_θ ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , and this is extremised when Vt11/2θVt11/2xθVt11x.iffproportional-tosuperscriptsubscript𝑉𝑡112𝜃superscriptsubscript𝑉𝑡112𝑥proportional-to𝜃superscriptsubscript𝑉𝑡11𝑥V_{t-1}^{1/2}\partial\theta\propto V_{t-1}^{-1/2}x\iff\partial\theta\propto V_% {t-1}^{-1}x.italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ∂ italic_θ ∝ italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_x ⇔ ∂ italic_θ ∝ italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x . Further, if for a scalar φ,𝜑\varphi,italic_φ , θ=φVt11x,𝜃𝜑superscriptsubscript𝑉𝑡11𝑥\partial\theta=\varphi\cdot V_{t-1}^{-1}x,∂ italic_θ = italic_φ ⋅ italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x , then

θVt1θ=φ2xVt11Vt1Vt11x=φ2xVt11x,superscript𝜃topsubscript𝑉𝑡1𝜃superscript𝜑2superscript𝑥topsuperscriptsubscript𝑉𝑡11subscript𝑉𝑡1superscriptsubscript𝑉𝑡11𝑥superscript𝜑2superscript𝑥topsuperscriptsubscript𝑉𝑡11𝑥\partial\theta^{\top}V_{t-1}\partial\theta=\varphi^{2}x^{\top}V_{t-1}^{-1}V_{t% -1}V_{t-1}^{-1}x=\varphi^{2}x^{\top}V_{t-1}^{-1}x,∂ italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∂ italic_θ = italic_φ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x = italic_φ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x ,

or equivalently, φVt11xVt1=|φ|xVt11subscriptnorm𝜑superscriptsubscript𝑉𝑡11𝑥subscript𝑉𝑡1𝜑subscriptnorm𝑥superscriptsubscript𝑉𝑡11\|\varphi\cdot V_{t-1}^{-1}x\|_{V_{t-1}}=|\varphi|\|x\|_{V_{t-1}^{-1}}∥ italic_φ ⋅ italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = | italic_φ | ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, which means that to obey θVt1ωt,subscriptnorm𝜃subscript𝑉𝑡1subscript𝜔𝑡\|\partial\theta\|_{V_{t-1}}\leq\omega_{t},∥ ∂ italic_θ ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , we must set θ=±ωtxVt11Vt11x𝜃plus-or-minussubscript𝜔𝑡subscriptnorm𝑥superscriptsubscript𝑉𝑡11superscriptsubscript𝑉𝑡11𝑥\partial\theta=\pm\frac{\omega_{t}}{\|x\|_{V_{t-1}^{-1}}}V_{t-1}^{-1}x∂ italic_θ = ± divide start_ARG italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x, and of these the +++ solution gives a positive value, and so is optimal.

Further notice that the optimal choice of aisuperscript𝑎𝑖\partial a^{i}∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for iUI𝑖subscript𝑈𝐼i\in U_{I}italic_i ∈ italic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT must similarly be aligned with Vt11xsuperscriptsubscript𝑉𝑡11𝑥V_{t-1}^{-1}xitalic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x. Indeed, write Vt11/2ai=ωtσiVt11/2x+ψi,superscriptsubscript𝑉𝑡112superscript𝑎𝑖subscript𝜔𝑡superscript𝜎𝑖superscriptsubscript𝑉𝑡112𝑥superscript𝜓𝑖V_{t-1}^{1/2}\partial a^{i}=\omega_{t}\sigma^{i}V_{t-1}^{-1/2}x+\psi^{i},italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_x + italic_ψ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , where σisuperscript𝜎𝑖\sigma^{i}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is a scalar, and ψisuperscript𝜓𝑖\psi^{i}italic_ψ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is a vector such that ψi,Vt11/2x=0.superscript𝜓𝑖superscriptsubscript𝑉𝑡112𝑥0\langle\psi^{i},V_{t-1}^{-1/2}x\rangle=0.⟨ italic_ψ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_x ⟩ = 0 . Then observe that due to the orthogonality,

aiVt12=Vt11/2ai,Vt11/2ai=ωtσiVt11/2x+ψi,ωtσiVt11/2x+ψi=ωt2(σi)2xVt112+ψi2,superscriptsubscriptnormsuperscript𝑎𝑖subscript𝑉𝑡12superscriptsubscript𝑉𝑡112superscript𝑎𝑖superscriptsubscript𝑉𝑡112superscript𝑎𝑖subscript𝜔𝑡superscript𝜎𝑖superscriptsubscript𝑉𝑡112𝑥superscript𝜓𝑖subscript𝜔𝑡superscript𝜎𝑖superscriptsubscript𝑉𝑡112𝑥superscript𝜓𝑖superscriptsubscript𝜔𝑡2superscriptsuperscript𝜎𝑖2superscriptsubscriptnorm𝑥superscriptsubscript𝑉𝑡112superscriptnormsuperscript𝜓𝑖2\|\partial a^{i}\|_{V_{t-1}}^{2}=\langle V_{t-1}^{1/2}\partial a^{i},V_{t-1}^{% 1/2}\partial a^{i}\rangle=\langle\omega_{t}\sigma^{i}V_{t-1}^{-1/2}x+\psi^{i},% \omega_{t}\sigma^{i}V_{t-1}^{-1/2}x+\psi^{i}\rangle=\omega_{t}^{2}(\sigma^{i})% ^{2}\|x\|_{V_{t-1}^{-1}}^{2}+\|\psi^{i}\|^{2},∥ ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ⟨ italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⟩ = ⟨ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_x + italic_ψ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_x + italic_ψ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⟩ = italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_ψ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

and so the the constraint on aisuperscript𝑎𝑖\partial a^{i}∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT becomes (ωtσi)2xVt11+ψi2ωt2.superscriptsubscript𝜔𝑡superscript𝜎𝑖2superscriptsubscriptnorm𝑥subscript𝑉𝑡11superscriptnormsuperscript𝜓𝑖2superscriptsubscript𝜔𝑡2(\omega_{t}\sigma^{i})^{2}\|x\|_{V_{t-1}}^{-1}+\|\psi^{i}\|^{2}\leq\omega_{t}^% {2}.( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + ∥ italic_ψ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . But ψisuperscript𝜓𝑖\psi^{i}italic_ψ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT affects neither the first constraint on ai+Δati+ai,xsuperscript𝑎𝑖Δsubscriptsuperscript𝑎𝑖𝑡superscript𝑎𝑖𝑥\langle a^{i}+\Delta a^{i}_{t}+\partial a^{i},x\rangle⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x ⟩, nor the objective, since

ai,x=Vt11/2ai,Vt11/2x=ωtσiVt11/2x,Vt11/2x+ψi,Vt11/2x=σixVt112.superscript𝑎𝑖𝑥superscriptsubscript𝑉𝑡112superscript𝑎𝑖superscriptsubscript𝑉𝑡112𝑥subscript𝜔𝑡superscript𝜎𝑖superscriptsubscript𝑉𝑡112𝑥superscriptsubscript𝑉𝑡112𝑥superscript𝜓𝑖superscriptsubscript𝑉𝑡112𝑥superscript𝜎𝑖superscriptsubscriptnorm𝑥superscriptsubscript𝑉𝑡112\langle\partial a^{i},x\rangle=\langle V_{t-1}^{1/2}\partial a^{i},V_{t-1}^{-1% /2}x\rangle=\langle\omega_{t}\sigma^{i}V_{t-1}^{-1/2}x,V_{t-1}^{-1/2}x\rangle+% \langle\psi^{i},V_{t-1}^{-1/2}x\rangle=\sigma^{i}\|x\|_{V_{t-1}^{-1}}^{2}.⟨ ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x ⟩ = ⟨ italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_x ⟩ = ⟨ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_x , italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_x ⟩ + ⟨ italic_ψ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_x ⟩ = italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

This means that dum** any energy into ψisuperscript𝜓𝑖\psi^{i}italic_ψ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT affects neither the constraints nor the objective, so we can safety set it to zero in the following (in fact, as we shall see below, it must be zero since σisuperscript𝜎𝑖\sigma^{i}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT must saturate). This allows us to considerably simplify the above program: by introducing the real valued variables σisuperscript𝜎𝑖\sigma^{i}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for iI𝑖𝐼i\in Iitalic_i ∈ italic_I, and noting that ai=0superscript𝑎𝑖0\partial a^{i}=0∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = 0 for iKI𝑖subscript𝐾𝐼i\in K_{I}italic_i ∈ italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT can be achieved by demanding (σi)2xVt110=𝟏Uisuperscriptsuperscript𝜎𝑖2subscriptnorm𝑥superscriptsubscript𝑉𝑡110superscriptsubscript1𝑈𝑖(\sigma^{i})^{2}\|x\|_{V_{t-1}^{-1}}\leq 0=\mathbf{1}_{U}^{i}( italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ 0 = bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for iKI,𝑖subscript𝐾𝐼i\in K_{I},italic_i ∈ italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , we may rewrite the program above as

θ,x+maxxmax{σi}𝜃superscript𝑥subscript𝑥subscriptsuperscript𝜎𝑖\displaystyle\langle\theta,x^{*}\rangle+\max_{x}\max_{\{\sigma^{i}\}}⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ + roman_max start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT { italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT Δθt,xiUIμiΔati,x+ωtxVt11iUIμiσiωtxVt112.Δsubscript𝜃𝑡𝑥subscript𝑖subscript𝑈𝐼superscript𝜇𝑖Δsubscriptsuperscript𝑎𝑖𝑡𝑥subscript𝜔𝑡subscriptnorm𝑥superscriptsubscript𝑉𝑡11subscript𝑖subscript𝑈𝐼superscript𝜇𝑖superscript𝜎𝑖subscript𝜔𝑡superscriptsubscriptnorm𝑥superscriptsubscript𝑉𝑡112\displaystyle\langle\Delta\theta_{t},x\rangle-\sum_{i\in U_{I}}\mu^{i}\langle% \Delta a^{i}_{t},x\rangle+\omega_{t}\|x\|_{V_{t-1}^{-1}}-\sum_{i\in U_{I}}\mu^% {i}\sigma^{i}\omega_{t}\|x\|_{V_{t-1}^{-1}}^{2}.⟨ roman_Δ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x ⟩ - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⟨ roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x ⟩ + italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
s.t. ai+Δati,x=αiωtσixVt112iI,formulae-sequencesuperscript𝑎𝑖Δsubscriptsuperscript𝑎𝑖𝑡𝑥superscript𝛼𝑖subscript𝜔𝑡superscript𝜎𝑖superscriptsubscriptnorm𝑥superscriptsubscript𝑉𝑡112for-all𝑖𝐼\displaystyle\langle a^{i}+\Delta a^{i}_{t},x\rangle=\alpha^{i}-\omega_{t}% \sigma^{i}\|x\|_{V_{t-1}^{-1}}^{2}\quad\forall i\in I,⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x ⟩ = italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∀ italic_i ∈ italic_I ,
ai,x=αiiKIformulae-sequencesuperscript𝑎𝑖𝑥superscript𝛼𝑖for-all𝑖subscript𝐾𝐼\displaystyle\langle a^{i},x\rangle=\alpha^{i}\quad\forall i\in K_{I}⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x ⟩ = italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∀ italic_i ∈ italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT
(σi)2xVt112𝟏UiiI.formulae-sequencesuperscriptsuperscript𝜎𝑖2superscriptsubscriptnorm𝑥superscriptsubscript𝑉𝑡112superscriptsubscript1𝑈𝑖for-all𝑖𝐼\displaystyle(\sigma^{i})^{2}\|x\|_{V_{t-1}^{-1}}^{2}\leq\mathbf{1}_{U}^{i}% \quad\forall i\in I.( italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∀ italic_i ∈ italic_I .

Finally, observe that the first pair of constraints can be succinctly written in terms of A^t(𝖴),subscript^𝐴𝑡𝖴\hat{A}_{t}(\mathsf{U}),over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( sansserif_U ) , giving us the following restatement, where σ𝜎\sigmaitalic_σ is the vector formed by stacking the σissuperscript𝜎𝑖𝑠\sigma^{i}sitalic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_s.

θ,x+maxxmaxσ𝜃superscript𝑥subscript𝑥subscript𝜎\displaystyle\langle\theta,x^{*}\rangle+\max_{x}\max_{\sigma}⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ + roman_max start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT Δθt,xiIμiΔati,x+ωtxVt11μ,σωtxVt112.Δsubscript𝜃𝑡𝑥subscript𝑖𝐼superscript𝜇𝑖Δsubscriptsuperscript𝑎𝑖𝑡𝑥subscript𝜔𝑡subscriptnorm𝑥superscriptsubscript𝑉𝑡11𝜇𝜎subscript𝜔𝑡superscriptsubscriptnorm𝑥superscriptsubscript𝑉𝑡112\displaystyle\langle\Delta\theta_{t},x\rangle-\sum_{i\in I}\mu^{i}\langle% \Delta a^{i}_{t},x\rangle+\omega_{t}\|x\|_{V_{t-1}^{-1}}-\langle\mu,\sigma% \rangle\omega_{t}\|x\|_{V_{t-1}^{-1}}^{2}.⟨ roman_Δ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x ⟩ - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⟨ roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x ⟩ + italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - ⟨ italic_μ , italic_σ ⟩ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
s.t. A^t(I)x=α(I)ωtxVt112σsubscript^𝐴𝑡𝐼𝑥𝛼𝐼subscript𝜔𝑡subscriptsuperscriptnorm𝑥2superscriptsubscript𝑉𝑡11𝜎\displaystyle\hat{A}_{t}(I)x=\alpha(I)-\omega_{t}\|x\|^{2}_{V_{t-1}^{-1}}\sigmaover^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_I ) italic_x = italic_α ( italic_I ) - italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_σ
(σi)2xVt112𝟏UiiI.formulae-sequencesuperscriptsuperscript𝜎𝑖2superscriptsubscriptnorm𝑥superscriptsubscript𝑉𝑡112superscriptsubscript1𝑈𝑖for-all𝑖𝐼\displaystyle(\sigma^{i})^{2}\|x\|_{V_{t-1}^{-1}}^{2}\leq\mathbf{1}_{U}^{i}% \quad\forall i\in I.( italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∀ italic_i ∈ italic_I .

But notice that A(KI)𝐴subscript𝐾𝐼A(K_{I})italic_A ( italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) is full row rank by assumption, and thus applying Lemma E.1, with probability one, A^t(I)subscript^𝐴𝑡𝐼\hat{A}_{t}(I)over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_I ) is full-rank. But this means that every value of σ𝜎\sigmaitalic_σ that meets the final constraint is feasible for the above program, since we can find an appropriate x𝑥xitalic_x by inverting A^t(I)subscript^𝐴𝑡𝐼\hat{A}_{t}(I)over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_I ). Of course, then the optimal choice of σisuperscript𝜎𝑖\sigma^{i}italic_σ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is then 𝟏Uisign(μi)/xVt11,superscriptsubscript1𝑈𝑖signsuperscript𝜇𝑖subscriptnorm𝑥superscriptsubscript𝑉𝑡11-\mathbf{1}_{U}^{i}\mathrm{sign}(\mu^{i})/\|x\|_{V_{t-1}^{-1}},- bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_sign ( italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) / ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , telling us that for each iUi,𝑖subscript𝑈𝑖i\in U_{i},italic_i ∈ italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , the optimal aisuperscript𝑎𝑖\partial a^{i}∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT at time t𝑡titalic_t is

ati=𝟏Uisign(μi)ωtVt11x/xVt11μiati,x=𝟏Uiωt|μi|xVt11.subscriptsuperscript𝑎𝑖𝑡superscriptsubscript1𝑈𝑖signsuperscript𝜇𝑖subscript𝜔𝑡superscriptsubscript𝑉𝑡11𝑥subscriptnorm𝑥superscriptsubscript𝑉𝑡11superscript𝜇𝑖subscriptsuperscript𝑎𝑖𝑡𝑥superscriptsubscript1𝑈𝑖subscript𝜔𝑡superscript𝜇𝑖subscriptnorm𝑥superscriptsubscript𝑉𝑡11\partial a^{i}_{t}=-\mathbf{1}_{U}^{i}\mathrm{sign}(\mu^{i})\omega_{t}V_{t-1}^% {-1}x/\|x\|_{V_{t-1}^{-1}}\implies\mu^{i}\langle\partial a^{i}_{t},x\rangle=% \mathbf{1}_{U}^{i}\omega_{t}|\mu^{i}|\cdot\|x\|_{V_{t-1}^{-1}}.∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_sign ( italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x / ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟹ italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⟨ ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x ⟩ = bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | ⋅ ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT .

Now, finally, we observe that for each iUI,𝑖subscript𝑈𝐼i\in U_{I},italic_i ∈ italic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , and every x𝑥xitalic_x, ωt|μi|xVt11μiΔati,x0subscript𝜔𝑡superscript𝜇𝑖subscriptnorm𝑥superscriptsubscript𝑉𝑡11superscript𝜇𝑖Δsubscriptsuperscript𝑎𝑖𝑡𝑥0\omega_{t}|\mu^{i}|\|x\|_{V_{t-1}^{-1}}-\mu^{i}\langle\Delta a^{i}_{t},x% \rangle\geq 0italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⟨ roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x ⟩ ≥ 0. Indeed, for iKI𝑖subscript𝐾𝐼i\in K_{I}italic_i ∈ italic_K start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT this is trivial since both Δati,atiΔsubscriptsuperscript𝑎𝑖𝑡subscriptsuperscript𝑎𝑖𝑡\Delta a^{i}_{t},\partial a^{i}_{t}roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∂ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are 00 for such i𝑖iitalic_i. For iUI,𝑖subscript𝑈𝐼i\in U_{I},italic_i ∈ italic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , since the confidence sets are consistent, we know that ai𝒞tiΔatiVt1ωt.iffsuperscript𝑎𝑖subscriptsuperscript𝒞𝑖𝑡subscriptnormΔsubscriptsuperscript𝑎𝑖𝑡subscript𝑉𝑡1subscript𝜔𝑡a^{i}\in\mathcal{C}^{i}_{t}\iff\|\Delta a^{i}_{t}\|_{V_{t-1}}\leq\omega_{t}.italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⇔ ∥ roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . But then

|μiΔati,x|=|μi||Vt11/2Δati,Vt11/2x|μi|ΔatiVt1xVt11|μi|ωtxVt11.superscript𝜇𝑖Δsubscriptsuperscript𝑎𝑖𝑡𝑥superscript𝜇𝑖delimited-|‖superscriptsubscript𝑉𝑡112Δsubscriptsuperscript𝑎𝑖𝑡superscriptsubscript𝑉𝑡112𝑥superscript𝜇𝑖subscriptnormΔsubscriptsuperscript𝑎𝑖𝑡subscript𝑉𝑡1subscriptnorm𝑥superscriptsubscript𝑉𝑡11superscript𝜇𝑖subscript𝜔𝑡subscriptnorm𝑥superscriptsubscript𝑉𝑡11|\mu^{i}\langle\Delta a^{i}_{t},x\rangle|=|\mu^{i}||\langle V_{t-1}^{1/2}% \Delta a^{i}_{t},V_{t-1}^{-1/2}x\rangle\|\leq|\mu^{i}|\|\Delta a^{i}_{t}\|_{V_% {t-1}}\|x\|_{V_{t-1}^{-1}}\leq|\mu^{i}|\omega_{t}\|x\|_{V_{t-1}^{-1}}.| italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⟨ roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x ⟩ | = | italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | | ⟨ italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_x ⟩ ∥ ≤ | italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | ∥ roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ | italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT .

But now we are in business. Indeed, using (6), we finally have

θ,δxt𝜃𝛿subscript𝑥𝑡\displaystyle\langle\theta,\delta x_{t}\rangle⟨ italic_θ , italic_δ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ =iIμiδati,xt=iIμiati,xtμiΔati,xtabsentsubscript𝑖𝐼superscript𝜇𝑖𝛿superscriptsubscript𝑎𝑡𝑖subscript𝑥𝑡subscript𝑖𝐼superscript𝜇𝑖superscriptsubscript𝑎𝑡𝑖subscript𝑥𝑡superscript𝜇𝑖Δsuperscriptsubscript𝑎𝑡𝑖subscript𝑥𝑡\displaystyle=\sum_{i\in I}-\mu^{i}\langle\delta a_{t}^{i},x_{t}\rangle=\sum_{% i\in I}-\mu^{i}\langle\partial a_{t}^{i},x_{t}\rangle-\mu^{i}\langle\Delta a_{% t}^{i},x_{t}\rangle= ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⟨ italic_δ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⟨ ∂ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⟨ roman_Δ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩
=iI|μi|ωtxtVt11μiΔati,xt0,absentsubscript𝑖𝐼superscript𝜇𝑖subscript𝜔𝑡subscriptnormsubscript𝑥𝑡superscriptsubscript𝑉𝑡11superscript𝜇𝑖Δsubscriptsuperscript𝑎𝑖𝑡subscript𝑥𝑡0\displaystyle=\sum_{i\in I}|\mu^{i}|\omega_{t}\|x_{t}\|_{V_{t-1}^{-1}}-\mu^{i}% \langle\Delta a^{i}_{t},x_{t}\rangle\geq 0,= ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT | italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⟨ roman_Δ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ 0 ,

and we are done. ∎

The role of the non-degeneracy condition Assumption 7.1 in the above is fairly weak: all we really need is that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT noisily activates some index set such that the true θ𝜃\thetaitalic_θ can be expressed via a linear combination of the true constraint vectors of the index set. In the absence of this, the proof does not quite work as stated, since it may be the case that some constraints that are needed to express θ𝜃\thetaitalic_θ are not noisily activated by xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (although such constraints are activated by xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT). This removes the equality of the various programs we wrote, and would only leave us with a lower bound (in terms of some of these active at xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT but not noisily active at xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT constraints, along with the ones above), and it is unclear if xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must also optimise this lower bound.

Nevertheless, we believe that this requirement is an artefact of our proof strategy: in general, optimistic play, when it leaks out of the safe set, has a tremendous freedom to activate any noisy constraints, and the conspiring of a choice of δθ𝛿𝜃\delta\thetaitalic_δ italic_θ and δA𝛿𝐴\delta Aitalic_δ italic_A that makes the point suboptimal is severely constrained due to the presence of a large number of over-efficient actions in the vicinity of the safe set. Exactly nailing down an argument that cleanly expresses this intuition is an open problem.

E.2 Proof of the Main Theorem

With all the pieces in place, we proceed to argue our main claim.

Proof of Theorem 7.3.

With probability at least 1δ1𝛿1-\delta1 - italic_δ, all the confidence sets are consistent. We assume that this indeed occurs, and argue the claim under this event.

We first split the time horizon into two groups depending on whether xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT noisily activates suboptimal BISs or not by defining

𝔗1:={t[d+1:T]: a suboptimal BIS I such that xt𝒳~tI}.\mathfrak{T}_{1}:=\{t\in[d+1:T]:\exists\textrm{ a suboptimal BIS $I$ such that% }x_{t}\in\widetilde{\mathcal{X}}_{t}^{I}\}.fraktur_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := { italic_t ∈ [ italic_d + 1 : italic_T ] : ∃ a suboptimal BIS italic_I such that italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT } .

Notice that for t[d+1:T]𝔗1,t\in[d+1:T]\setminus\mathfrak{T}_{1},italic_t ∈ [ italic_d + 1 : italic_T ] ∖ fraktur_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT only activates optimal BISs.

Now, by Lemma 6.9, for all t𝔗1,𝑡subscript𝔗1t\in\mathfrak{T}_{1},italic_t ∈ fraktur_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ρt(xt;δ)Γ,subscript𝜌𝑡subscript𝑥𝑡𝛿Γ\rho_{t}(x_{t};\delta)\geq\Gamma,italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ≥ roman_Γ , and further by the Lemma 7.2, for every t[d+1:T]𝔗1,t\in[d+1:T]\setminus\mathfrak{T}_{1},italic_t ∈ [ italic_d + 1 : italic_T ] ∖ fraktur_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , it holds that θ,xxt0𝜃superscript𝑥subscript𝑥𝑡0\langle\theta,x^{*}-x_{t}\rangle\leq 0⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≤ 0. Finally, we observe that it must hold that for all times

θ,xtθ,xρt(xt;δ).𝜃subscript𝑥𝑡𝜃superscript𝑥subscript𝜌𝑡subscript𝑥𝑡𝛿\langle\theta,x_{t}\rangle\geq\langle\theta,x^{*}\rangle-\rho_{t}(x_{t};\delta).⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ - italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) .

Indeed, due to consistency, both θ𝜃\thetaitalic_θ and xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are feasible choices for the actions of doss. Thus, if some θ~,xt~𝜃subscript𝑥𝑡\tilde{\theta},x_{t}over~ start_ARG italic_θ end_ARG , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are chosen instead, then θ~,xtθ,x.~𝜃subscript𝑥𝑡𝜃superscript𝑥\langle\tilde{\theta},x_{t}\rangle\geq\langle\theta,x^{*}\rangle.⟨ over~ start_ARG italic_θ end_ARG , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ . But by Lemma 3.2, under consistency, θ,xtθ~,xtρt(xt;δ)𝜃subscript𝑥𝑡~𝜃subscript𝑥𝑡subscript𝜌𝑡subscript𝑥𝑡𝛿\langle\theta,x_{t}\rangle\geq\langle\tilde{\theta},x_{t}\rangle-\rho_{t}(x_{t% };\delta)⟨ italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ ⟨ over~ start_ARG italic_θ end_ARG , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ), giving the above claim.

We thus have the efficacy control

Tsubscript𝑇\displaystyle\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT =tθ,xxt+=tdθ,xxt++t𝔗1θ,xxt++t𝔗1θ,xxt+absentsubscript𝑡subscript𝜃superscript𝑥subscript𝑥𝑡subscript𝑡𝑑subscript𝜃superscript𝑥subscript𝑥𝑡subscript𝑡subscript𝔗1subscript𝜃superscript𝑥subscript𝑥𝑡subscript𝑡subscript𝔗1subscript𝜃superscript𝑥subscript𝑥𝑡\displaystyle=\sum_{t}\langle\theta,x^{*}-x_{t}\rangle_{+}=\sum_{t\leq d}% \langle\theta,x^{*}-x_{t}\rangle_{+}+\sum_{t\in\mathfrak{T}_{1}}\langle\theta,% x^{*}-x_{t}\rangle_{+}+\sum_{t\not\in\mathfrak{T}_{1}}\langle\theta,x^{*}-x_{t% }\rangle_{+}= ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t ≤ italic_d end_POSTSUBSCRIPT ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t ∈ fraktur_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t ∉ fraktur_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT
d+t𝔗1ρt(xt;δ)+0absent𝑑subscript𝑡subscript𝔗1subscript𝜌𝑡subscript𝑥𝑡𝛿0\displaystyle\leq d+\sum_{t\in\mathfrak{T}_{1}}\rho_{t}(x_{t};\delta)+0≤ italic_d + ∑ start_POSTSUBSCRIPT italic_t ∈ fraktur_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) + 0
d+tρt(xt;δ)𝟙{ρt(xt;δ)Γ}absent𝑑subscript𝑡subscript𝜌𝑡subscript𝑥𝑡𝛿1subscript𝜌𝑡subscript𝑥𝑡𝛿Γ\displaystyle\leq d+\sum_{t}\rho_{t}(x_{t};\delta)\mathds{1}\{\rho_{t}(x_{t};% \delta)\geq\Gamma\}≤ italic_d + ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) blackboard_1 { italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ≥ roman_Γ }
d+tρt(xt;δ)ρt(xt;δ)Γabsent𝑑subscript𝑡subscript𝜌𝑡subscript𝑥𝑡𝛿subscript𝜌𝑡subscript𝑥𝑡𝛿Γ\displaystyle\leq d+\sum_{t}\rho_{t}(x_{t};\delta)\cdot\frac{\rho_{t}(x_{t};% \delta)}{\Gamma}≤ italic_d + ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ⋅ divide start_ARG italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) end_ARG start_ARG roman_Γ end_ARG
=d+1Γtρt(xt;δ)2,absent𝑑1Γsubscript𝑡subscript𝜌𝑡superscriptsubscript𝑥𝑡𝛿2\displaystyle=d+\frac{1}{\Gamma}\sum_{t}\rho_{t}(x_{t};\delta)^{2},= italic_d + divide start_ARG 1 end_ARG start_ARG roman_Γ end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

whence the claimed bound follows upon using Lemma B.2. As in §5, we have used the trick that 𝟙{uv}u/v1𝑢𝑣𝑢𝑣\mathds{1}\{u\geq v\}\leq u/vblackboard_1 { italic_u ≥ italic_v } ≤ italic_u / italic_v for positive v𝑣vitalic_v.

To control the safety behaviour, we observe that due to the property that xt𝒮~t(δ),subscript𝑥𝑡subscript~𝒮𝑡𝛿x_{t}\in\widetilde{\mathcal{S}}_{t}(\delta),italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) , there must exist some witness A~t𝓒t(δ)subscript~𝐴𝑡subscript𝓒𝑡𝛿\tilde{A}_{t}\in\boldsymbol{\mathcal{{C}}}_{t}(\delta)over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) such that A~txtαsubscript~𝐴𝑡subscript𝑥𝑡𝛼\tilde{A}_{t}x_{t}\leq\alphaover~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_α. But, again by Lemma 3.2 that under consistency, for every i𝑖iitalic_i,

a~ti,xtai,xtρt(xt;δ),superscriptsubscript~𝑎𝑡𝑖subscript𝑥𝑡superscript𝑎𝑖subscript𝑥𝑡subscript𝜌𝑡subscript𝑥𝑡𝛿\langle\tilde{a}_{t}^{i},x_{t}\rangle\geq\langle a^{i},x_{t}\rangle-\rho_{t}(x% _{t};\delta),⟨ over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ,

which implies that

maxia~i,xtαiρt(xt;δ).subscript𝑖superscript~𝑎𝑖subscript𝑥𝑡superscript𝛼𝑖subscript𝜌𝑡subscript𝑥𝑡𝛿\max_{i}\langle\tilde{a}^{i},x_{t}\rangle-\alpha^{i}\leq\rho_{t}(x_{t};\delta).roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≤ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) .

But then

𝒮T=tTmaxi(a~i,xtαi)+tTρt(xt;δ),\mathscr{S}_{T}=\sum_{t\leq T}\max_{i}(\langle\tilde{a}^{i},x_{t}\rangle-% \alpha^{i})_{+}\leq\sum_{t\leq T}\rho_{t}(x_{t};\delta),script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⟨ over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ,

and the claim is immediate from Lemma B.2. Above, we have used the elementary fact that if uv𝑢𝑣u\leq vitalic_u ≤ italic_v and v>0,𝑣0v>0,italic_v > 0 , then (u)+vsubscript𝑢𝑣(u)_{+}\leq v( italic_u ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ≤ italic_v.

We note that the upper bound in Theorem 4.1 is also immediate from the above argument. The control on 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT can be repeated verbatim, while to control T,subscript𝑇\mathscr{E}_{T},script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , we note that we began by showing that θ,xxtρt(xt;δ),𝜃superscript𝑥subscript𝑥𝑡subscript𝜌𝑡subscript𝑥𝑡𝛿\langle\theta,x^{*}-x_{t}\rangle\leq\rho_{t}(x_{t};\delta),⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≤ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) , so the conclusion of the control on 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT above can be repeated verbatim. ∎

E.3 Proofs of Polylogarithmic Safety Violation Claims from §7

Finally, we show the proof of the subsidiary observation from §7.

E.3.1 Finite Precision in Constraint Levels

The argument relies on the following observation.

Lemma E.2.

Under consistency, for every ε>0,t𝜀0𝑡\varepsilon>0,titalic_ε > 0 , italic_t if doss(δ𝛿\deltaitalic_δ) plays an action xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that maxi(ai,xtαi)+ε\max_{i}(\langle a^{i},x_{t}\rangle-\alpha^{i})_{+}\geq\varepsilonroman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ≥ italic_ε then ρt(xt;δ)εsubscript𝜌𝑡subscript𝑥𝑡𝛿𝜀\rho_{t}(x_{t};\delta)\geq\varepsilonitalic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ≥ italic_ε.

Proof.

As in the proof of Theorem 7.3, if the algorithm plays xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, then

A~𝓒t(δ):A~xtα.:~𝐴subscript𝓒𝑡𝛿~𝐴subscript𝑥𝑡𝛼\exists\tilde{A}\in\boldsymbol{\mathcal{{C}}}_{t}(\delta):\tilde{A}x_{t}\leq\alpha.∃ over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) : over~ start_ARG italic_A end_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_α .

But, under consistency, by Lemma 3.2,

A~xtAxtρt(xt;δ)𝟏U,~𝐴subscript𝑥𝑡𝐴subscript𝑥𝑡subscript𝜌𝑡subscript𝑥𝑡𝛿subscript1𝑈\tilde{A}x_{t}\geq Ax_{t}-\rho_{t}(x_{t};\delta)\mathbf{1}_{U},over~ start_ARG italic_A end_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_A italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) bold_1 start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ,

and so for every i𝑖iitalic_i,

a,xtαia~i,xt+ρt(xt;δ)αiρt(xt;δ),𝑎subscript𝑥𝑡superscript𝛼𝑖superscript~𝑎𝑖subscript𝑥𝑡subscript𝜌𝑡subscript𝑥𝑡𝛿superscript𝛼𝑖subscript𝜌𝑡subscript𝑥𝑡𝛿\langle a,x_{t}\rangle-\alpha^{i}\leq\langle\tilde{a}^{i},x_{t}\rangle+\rho_{t% }(x_{t};\delta)-\alpha^{i}\leq\rho_{t}(x_{t};\delta),⟨ italic_a , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≤ ⟨ over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≤ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ,

and the claim follows by maximising over i𝑖iitalic_i. ∎

The above is enough to enable the argument, which goes along the lines of the proof of logarithmic bounds on Tsubscript𝑇\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT in Theorem 7.3.

Proof of Theorem 7.5.

As always, we begin by assuming consistency of the confidence sets, which occurs with probability at least 1δ1𝛿1-\delta1 - italic_δ. Observe that the proof of efficacy can be repeated verbatim from the previous section under consistency. To control the net violations, first recall that by Lemma E.2, i:ai,xtαi>ερt(xt;δ).:𝑖superscript𝑎𝑖subscript𝑥𝑡superscript𝛼𝑖𝜀subscript𝜌𝑡subscript𝑥𝑡𝛿\exists i:\langle a^{i},x_{t}\rangle-\alpha^{i}>\varepsilon\implies\rho_{t}(x_% {t};\delta).∃ italic_i : ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT > italic_ε ⟹ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) . It thus follows that

𝒮Tεsuperscriptsubscript𝒮𝑇𝜀\displaystyle\mathscr{S}_{T}^{\varepsilon}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ε end_POSTSUPERSCRIPT =tT(ai,xtαi)𝟙{i:ai,xtαi>ε}absentsubscript𝑡𝑇superscript𝑎𝑖subscript𝑥𝑡superscript𝛼𝑖1conditional-set𝑖superscript𝑎𝑖subscript𝑥𝑡superscript𝛼𝑖𝜀\displaystyle=\sum_{t\leq T}(\langle a^{i},x_{t}\rangle-\alpha^{i})\mathds{1}% \{\exists i:\langle a^{i},x_{t}\rangle-\alpha^{i}>\varepsilon\}= ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT ( ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) blackboard_1 { ∃ italic_i : ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT > italic_ε }
tTρt(xt;δ)𝟙{ρt(xt;δ)>ε}absentsubscript𝑡𝑇subscript𝜌𝑡subscript𝑥𝑡𝛿1subscript𝜌𝑡subscript𝑥𝑡𝛿𝜀\displaystyle\leq\sum_{t\leq T}\rho_{t}(x_{t};\delta)\mathds{1}\{\rho_{t}(x_{t% };\delta)>\varepsilon\}≤ ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) blackboard_1 { italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) > italic_ε }
ρt(xt;δ)2/ε,absentsubscript𝜌𝑡superscriptsubscript𝑥𝑡𝛿2𝜀\displaystyle\leq\sum\rho_{t}(x_{t};\delta)^{2}/\varepsilon,≤ ∑ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε ,

and the claim follows from Lemma B.2. ∎

E.3.2 Finite Precision in Constraints

We argue that due to the finite precision in the constraint levels, there exists a minimal error scale for the problem.

Lemma E.3.

There exists a constant π>0𝜋0\pi>0italic_π > 0 such that if the confidence sets are consistent, and that the finite-constraint-precision version of doss(δ𝛿\deltaitalic_δ) picks an xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that only activates optimal BISs, but xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is either infeasible or ineffective, then ρt(xt;δ)πsubscript𝜌𝑡subscript𝑥𝑡𝛿𝜋\rho_{t}(x_{t};\delta)\geq\piitalic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ≥ italic_π.

Proof.

Let I𝐼Iitalic_I be an optimal BIS that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT noisily activates. This is again full rank by Assumption 7.1, and there exists some A~𝓒t𝖯~𝐴superscriptsubscript𝓒𝑡𝖯\tilde{A}\in\boldsymbol{\mathcal{{C}}}_{t}^{\mathsf{P}}over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_P end_POSTSUPERSCRIPT such that A~(I)xt=α(I),A~(I)xtα.formulae-sequence~𝐴𝐼subscript𝑥𝑡𝛼𝐼~𝐴𝐼subscript𝑥𝑡𝛼\tilde{A}(I)x_{t}=\alpha(I),\tilde{A}(I)x_{t}\leq\alpha.over~ start_ARG italic_A end_ARG ( italic_I ) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α ( italic_I ) , over~ start_ARG italic_A end_ARG ( italic_I ) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_α . As in the proof of Theorem 7.3, we can restrict attention to A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG such that A~(I)~𝐴𝐼\tilde{A}(I)over~ start_ARG italic_A end_ARG ( italic_I ) is full-rank, since one such I,A~𝐼~𝐴I,\tilde{A}italic_I , over~ start_ARG italic_A end_ARG must exist.

Since both A~(I)~𝐴𝐼\tilde{A}(I)over~ start_ARG italic_A end_ARG ( italic_I ) is full rank, we immediately know that xt=A~(I)1α(I).subscript𝑥𝑡~𝐴superscript𝐼1𝛼𝐼x_{t}=\tilde{A}(I)^{-1}\alpha(I).italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over~ start_ARG italic_A end_ARG ( italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_α ( italic_I ) . But then, since there are only a finite number of possible choices for A~(I)~𝐴𝐼\tilde{A}(I)over~ start_ARG italic_A end_ARG ( italic_I ) in 𝖯,𝖯\mathsf{P},sansserif_P , there are only a finite number of candidate xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Let us define x(A~(I))=A~(I)1α(I)𝑥~𝐴𝐼~𝐴superscript𝐼1𝛼𝐼x(\tilde{A}(I))=\tilde{A}(I)^{-1}\alpha(I)italic_x ( over~ start_ARG italic_A end_ARG ( italic_I ) ) = over~ start_ARG italic_A end_ARG ( italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_α ( italic_I ), and 𝒳(I)={x(A~(I))}.𝒳𝐼𝑥~𝐴𝐼\mathcal{X}(I)=\{x(\tilde{A}(I))\}.caligraphic_X ( italic_I ) = { italic_x ( over~ start_ARG italic_A end_ARG ( italic_I ) ) } . Since xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is assumed to be the unique optimum, we know that for each x𝒳(I),𝑥𝒳𝐼x\in\mathcal{X}(I),italic_x ∈ caligraphic_X ( italic_I ) , it must hold that π(x):=max{(θ,xx,maxi(ai,xαi)+}\pi(x):=\max\left\{(\langle\theta,x^{*}-x\rangle,\max_{i}(\langle a^{i},x% \rangle-\alpha^{i})_{+}\right\}italic_π ( italic_x ) := roman_max { ( ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x ⟩ , roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT } is strictly positive, which in turn yields that

πI:=minx𝒳(I)π(x)>0.assignsubscript𝜋𝐼subscript𝑥𝒳𝐼𝜋𝑥0\pi_{I}:=\min_{x\in\mathcal{X}(I)}\pi(x)>0.italic_π start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT := roman_min start_POSTSUBSCRIPT italic_x ∈ caligraphic_X ( italic_I ) end_POSTSUBSCRIPT italic_π ( italic_x ) > 0 .

Of course, we also conclude then that if xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT noisily activates I𝐼Iitalic_I but is infeasible or suboptimal, then it must be at least πIsuperscript𝜋𝐼\pi^{I}italic_π start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT-infeasible or πIsuperscript𝜋𝐼\pi^{I}italic_π start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT-suboptimal, which via Lemma 3.2 and an argument similar to that in the proof of Lemma E.2 yields that ρt(xt;δ)πIsubscript𝜌𝑡subscript𝑥𝑡𝛿subscript𝜋𝐼\rho_{t}(x_{t};\delta)\geq\pi_{I}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ≥ italic_π start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT.

Of course, since some optimal full rank BIS must be activated, we conclude that if xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is not the optimum, then

ρt(xt;δ)π:=min optimal BISs IπI,subscript𝜌𝑡subscript𝑥𝑡𝛿𝜋assignsubscript optimal BISs 𝐼subscript𝜋𝐼\rho_{t}(x_{t};\delta)\geq\pi:=\min_{\textrm{ optimal BISs }I}\pi_{I},italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ≥ italic_π := roman_min start_POSTSUBSCRIPT optimal BISs italic_I end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ,

and we are done.∎

Let us note that the argument above is quite crude, in that we simply take a minimum over all candidates once we establish the finitude of the set of these candidates. A more refined analysis may recover stronger local behaviour by analysing what types of A~(I)~𝐴𝐼\tilde{A}(I)over~ start_ARG italic_A end_ARG ( italic_I ) remain in 𝓒t(δ)subscript𝓒𝑡𝛿\boldsymbol{\mathcal{{C}}}_{t}(\delta)bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ ) once enough information has been accumulated, and use this to develop notions of gaps for finite-constraint-precision scenarios that dominate the quantity we have constructed above. We leave this interesting line of study for future work.

In any case, exploiting the above yields the result.

Proof of Theorem 7.6.

Working along the lines of the proof of Theorem 7.3 yields control in both the efficacy and safety costs accumulated over times t𝑡titalic_t for which a suboptimal BIS was activated of the form O(Γ1d2log2T)𝑂superscriptΓ1superscript𝑑2superscript2𝑇O(\Gamma^{1}d^{2}\log^{2}T)italic_O ( roman_Γ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ). Restricting attention then to optimal BISs, by the above, if a suboptimal or infeasible action xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT were picked, then by the above Lemma, ρt(xt;δ)πsubscript𝜌𝑡subscript𝑥𝑡𝛿𝜋\rho_{t}(x_{t};\delta)\geq\piitalic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ≥ italic_π. This lets us repeat the same argument, but now over t𝑡titalic_t for which an optimal BIS was activated, which yields bounds of O(π1d2log2(T))𝑂superscript𝜋1superscript𝑑2superscript2𝑇O(\pi^{-1}d^{2}\log^{2}(T))italic_O ( italic_π start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_T ) ), and the overall costs is bounded by the sum of these two quantities. ∎

E.3.3 Finite Action Setting.

Let us specify the setting in a little more detail: we are supplied with a finite set 𝒜d𝒜superscript𝑑\mathcal{A}\subset\mathbb{R}^{d}caligraphic_A ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and in each round the learner chooses one action xt𝒜subscript𝑥𝑡𝒜x_{t}\in\mathcal{A}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A. The linear reward and constraint structures are kept identical, and xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is updated to be the best action in 𝒜,𝒜\mathcal{A},caligraphic_A , i.e.,

x:=missingargmaxθ,x:Axα,x𝒜.:assignsuperscript𝑥missing𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝑥formulae-sequence𝐴𝑥𝛼𝑥𝒜x^{*}:=\mathop{\mathrm{missing}}{arg\,max}\langle\theta,x\rangle:Ax\leq\alpha,% x\in\mathcal{A}.italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := roman_missing italic_a italic_r italic_g italic_m italic_a italic_x ⟨ italic_θ , italic_x ⟩ : italic_A italic_x ≤ italic_α , italic_x ∈ caligraphic_A .

Note that the known constraints are no longer necessary: if they are given, then we may filter 𝒜𝒜\mathcal{A}caligraphic_A before play starts. The gap Δ:=minx𝒜,xxmax(θ,xx,maxi(ai,xαi)+)\Delta:=\min_{x\in\mathcal{A},x\neq x^{*}}\max(\langle\theta,x^{*}-x\rangle,% \max_{i}(\langle a^{i},x\rangle-\alpha^{i})_{+})roman_Δ := roman_min start_POSTSUBSCRIPT italic_x ∈ caligraphic_A , italic_x ≠ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_max ( ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x ⟩ , roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) is non-zero simply because each suboptimal arm in 𝒜𝒜\mathcal{A}caligraphic_A must be either infeasible, or ineffective, and the minimisation is over a finite set.

The result relies on the following observation, which follows straightforwardly from Lemma 3.2.

Lemma E.4.

If the confidence sets are consistent, and the modified finite-action version of doss chooses xtxsubscript𝑥𝑡subscript𝑥x_{t}\neq x_{*}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT from 𝒜,𝒜\mathcal{A},caligraphic_A , then ρtΔsubscript𝜌𝑡Δ\rho_{t}\geq\Deltaitalic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ roman_Δ

Proof of Lemma E.4.

Notice that the basic result Lemma 3.2 remains valid in this setting. As a result, if the confidence sets are consistent, then since xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is permissible, there exists A~𝓒t~𝐴subscript𝓒𝑡\tilde{A}\in\boldsymbol{\mathcal{{C}}}_{t}over~ start_ARG italic_A end_ARG ∈ bold_caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that A~xtα,~𝐴subscript𝑥𝑡𝛼\tilde{A}x_{t}\leq\alpha,over~ start_ARG italic_A end_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_α , and some θ~𝒞tθ:θ~,xθ,x:~𝜃superscriptsubscript𝒞𝑡𝜃~𝜃𝑥𝜃superscript𝑥\tilde{\theta}\in\mathcal{C}_{t}^{\theta}:\langle\tilde{\theta},x\rangle\geq% \langle\theta,x^{*}\rangleover~ start_ARG italic_θ end_ARG ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT : ⟨ over~ start_ARG italic_θ end_ARG , italic_x ⟩ ≥ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩. Further, either there exists i:ai,xtαi+Δ:𝑖superscript𝑎𝑖subscript𝑥𝑡superscript𝛼𝑖Δi:\langle a^{i},x_{t}\rangle\geq\alpha^{i}+\Deltaitalic_i : ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + roman_Δ or θ,xθ,xΔ𝜃𝑥𝜃superscript𝑥Δ\langle\theta,x\rangle\leq\langle\theta,x^{*}\rangle-\Delta⟨ italic_θ , italic_x ⟩ ≤ ⟨ italic_θ , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ - roman_Δ. But by consistency, a~i,xtai,xtρt(xt;δ)superscript~𝑎𝑖subscript𝑥𝑡superscript𝑎𝑖subscript𝑥𝑡subscript𝜌𝑡subscript𝑥𝑡𝛿\langle\tilde{a}^{i},x_{t}\rangle\geq\langle a^{i},x_{t}\rangle-\rho_{t}(x_{t}% ;\delta)⟨ over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) and θ~,xθ,x+ρt(xt;δ),~𝜃𝑥𝜃𝑥subscript𝜌𝑡subscript𝑥𝑡𝛿\langle\tilde{\theta},x\rangle\leq\langle\theta,x\rangle+\rho_{t}(x_{t};\delta),⟨ over~ start_ARG italic_θ end_ARG , italic_x ⟩ ≤ ⟨ italic_θ , italic_x ⟩ + italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) , so either case implies ρt(xt;δ)Δ,subscript𝜌𝑡subscript𝑥𝑡𝛿Δ\rho_{t}(x_{t};\delta)\geq\Delta,italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ≥ roman_Δ , which thus must hold. ∎

Proof of Proposition 7.7.

The claim can be shown using Lemma E.4 along the lines of the proof of Theorem 7.5. ∎

Appendix F Proofs of Lower Bounds

We conclude by showing the lower bounds claimed in the main text.

F.1 Proof of Polynomial Lower Bound

We argue Theorem 5.1 by fleshing out the example developed in § 5. The proof uses techniques that are largely standard in the bandit literature (Lattimore and Szepesvári, 2020, Ch. 24).

Proof of Theorem 5.1.

The instance we consider is

𝒳=[0,1],θ=1,a1=(1±κ)/2,α1=1/4,wtii.i.d.𝒩(0,1),i{0,1}\mathcal{X}=[0,1],\theta^{*}=1,a^{1}=(1\pm\kappa)/2,\alpha^{1}=1/4,w_{t}^{i}% \stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(0,1),i\in\{0,1\}caligraphic_X = [ 0 , 1 ] , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 1 , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = ( 1 ± italic_κ ) / 2 , italic_α start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = 1 / 4 , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG italic_i . italic_i . italic_d . end_ARG end_RELOP caligraphic_N ( 0 , 1 ) , italic_i ∈ { 0 , 1 }

for some κ(0,1/4)𝜅014\kappa\in(0,1/4)italic_κ ∈ ( 0 , 1 / 4 ). Note that implicitly, the above has the know constraints x0𝑥0-x\leq 0- italic_x ≤ 0 and x1𝑥1x\leq 1italic_x ≤ 1. Of course, this one-dimensional construction can be embedded into an arbitrary dimension (for instance, by taking a very skinny box domain, and only enforcing this single unknown constraint).

In the above case, the optimal feasible solutions are x+=12(1+κ),x=12(1κ)formulae-sequencesuperscript𝑥121𝜅superscript𝑥121𝜅x^{+}=\frac{1}{2(1+\kappa)},x^{-}=\frac{1}{2(1-\kappa)}italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 ( 1 + italic_κ ) end_ARG , italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_κ ) end_ARG for these two instance respectively. In addition, both of these two instances are at least 1/8limit-from181/8-1 / 8 -well separated. The key observation is the indistinguishability of these two instances with 1/κ2much-less-thanabsent1superscript𝜅2\ll 1/\kappa^{2}≪ 1 / italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT actions.

Indeed, let +,superscriptsuperscript\mathbb{P}^{+},\mathbb{P}^{-}blackboard_P start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , blackboard_P start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT be the distributions induced by the two problem instances and the learning algorithm. Since in either case, the noise distribution is standard Gaussian, and the reward distributions are identical, it follows that

D(+(rt,st1)(rt,st1)|xt=x)=(κx)22κ22,𝐷superscriptsubscript𝑟𝑡superscriptsubscript𝑠𝑡1delimited-‖|superscriptsubscript𝑟𝑡superscriptsubscript𝑠𝑡1subscript𝑥𝑡𝑥superscript𝜅𝑥22superscript𝜅22D(\mathbb{P}^{+}(r_{t},s_{t}^{1})\|\mathbb{P}^{-}(r_{t},s_{t}^{1})|x_{t}=x)=% \frac{(\kappa x)^{2}}{2}\leq\frac{\kappa^{2}}{2},italic_D ( blackboard_P start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) ∥ blackboard_P start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x ) = divide start_ARG ( italic_κ italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ≤ divide start_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ,

where we have used standard results about the KL-divergence between two Gaussians. Further, since actions must be causal, and since the noise is independent, we conclude that over the whole trajectory,

D(+(T)(T))Tκ22.𝐷conditionalsuperscriptsubscript𝑇superscriptsubscript𝑇𝑇superscript𝜅22D(\mathbb{P}^{+}(\mathcal{H}_{T})\|\mathbb{P}^{-}(\mathcal{H}_{T}))\leq\frac{T% \kappa^{2}}{2}.italic_D ( blackboard_P start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∥ blackboard_P start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) ≤ divide start_ARG italic_T italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG .

Let x𝖺𝗏:=(x++x)/2=12(1κ2)assignsuperscript𝑥𝖺𝗏superscript𝑥superscript𝑥2121superscript𝜅2x^{\mathsf{av}}:=(x^{+}+x^{-})/2=\frac{1}{2(1-\kappa^{2})}italic_x start_POSTSUPERSCRIPT sansserif_av end_POSTSUPERSCRIPT := ( italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) / 2 = divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG. Observe that

  • if the ground truth is a1=(1+κ)/2superscript𝑎11𝜅2a^{1}=(1+\kappa)/2italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = ( 1 + italic_κ ) / 2 and xtx𝖺𝗏subscript𝑥𝑡superscript𝑥𝖺𝗏x_{t}\geq x^{\mathsf{av}}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_x start_POSTSUPERSCRIPT sansserif_av end_POSTSUPERSCRIPT, then the algorithm incurs an instantaneous safety violation of at least (1+κ)/2x𝖺𝗏1/4=1+κ212(1κ2)14=κ4(1κ)κ41𝜅2superscript𝑥𝖺𝗏141𝜅2121superscript𝜅214𝜅41𝜅𝜅4(1+\kappa)/2\cdot x^{\mathsf{av}}-1/4=\frac{1+\kappa}{2}\cdot\frac{1}{2(1-% \kappa^{2})}-\frac{1}{4}=\frac{\kappa}{4(1-\kappa)}\geq\frac{\kappa}{4}( 1 + italic_κ ) / 2 ⋅ italic_x start_POSTSUPERSCRIPT sansserif_av end_POSTSUPERSCRIPT - 1 / 4 = divide start_ARG 1 + italic_κ end_ARG start_ARG 2 end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG - divide start_ARG 1 end_ARG start_ARG 4 end_ARG = divide start_ARG italic_κ end_ARG start_ARG 4 ( 1 - italic_κ ) end_ARG ≥ divide start_ARG italic_κ end_ARG start_ARG 4 end_ARG;

  • if the ground truth is a1=(1κ)/2superscript𝑎11𝜅2a^{1}=(1-\kappa)/2italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = ( 1 - italic_κ ) / 2 xt<x𝖺𝗏subscript𝑥𝑡superscript𝑥𝖺𝗏x_{t}<x^{\mathsf{av}}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < italic_x start_POSTSUPERSCRIPT sansserif_av end_POSTSUPERSCRIPT, then the algorithm incurs an instantaneous efficacy regret of at least 12(1κ)12(1κ2)κ2121𝜅121superscript𝜅2𝜅2\frac{1}{2(1-\kappa)}-\frac{1}{2(1-\kappa^{2})}\geq\frac{\kappa}{2}divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_κ ) end_ARG - divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ≥ divide start_ARG italic_κ end_ARG start_ARG 2 end_ARG

Let 𝖠𝖠\mathsf{A}sansserif_A be the event {#{t:xtx𝖺𝗏}T/2}.#conditional-set𝑡subscript𝑥𝑡superscript𝑥𝖺𝗏𝑇2\{\#\{t:x_{t}\geq x^{\mathsf{av}}\}\geq T/2\}.{ # { italic_t : italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_x start_POSTSUPERSCRIPT sansserif_av end_POSTSUPERSCRIPT } ≥ italic_T / 2 } . Using the Bretagnolle-Huber inequality (Lattimore and Szepesvári, 2020, Thm. 14.2),

+(𝖠)+(𝖠c)12exp(D(+(T)(T)))12exp(Tκ2/2).superscript𝖠superscriptsuperscript𝖠𝑐12𝐷conditionalsuperscriptsubscript𝑇superscriptsubscript𝑇12𝑇superscript𝜅22\mathbb{P}^{+}(\mathsf{A})+\mathbb{P}^{-}(\mathsf{A}^{c})\geq\frac{1}{2}\exp% \left(D(\mathbb{P}^{+}(\mathcal{H}_{T})\|\mathbb{P}^{-}(\mathcal{H}_{T}))% \right)\geq\frac{1}{2}\exp(-T\kappa^{2}/2).blackboard_P start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( sansserif_A ) + blackboard_P start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( sansserif_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_exp ( italic_D ( blackboard_P start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∥ blackboard_P start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) ) ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_exp ( - italic_T italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) .

Let Tsuperscriptsubscript𝑇\mathscr{E}_{T}^{-}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT denote the efficacy regret incurred by the learner under superscript\mathbb{P}^{-}blackboard_P start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT and 𝒮T+superscriptsubscript𝒮𝑇\mathscr{S}_{T}^{+}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT denote the safety violation incurred by the learner under +superscript\mathbb{P}^{+}blackboard_P start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Under the event 𝖠,𝖠\mathsf{A},sansserif_A , if the true a𝑎aitalic_a was (1+κ)/2,1𝜅2(1+\kappa)/2,( 1 + italic_κ ) / 2 , at least T/2𝑇2T/2italic_T / 2 rounds incurred a safety regret of at least κ/4,𝜅4\kappa/4,italic_κ / 4 , and so 𝒮T+κT/8superscriptsubscript𝒮𝑇𝜅𝑇8\mathscr{S}_{T}^{+}\geq\kappa T/8script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ≥ italic_κ italic_T / 8. Similarly, under 𝖠c,superscript𝖠𝑐\mathsf{A}^{c},sansserif_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , at least T/2𝑇2T/2italic_T / 2 rounds had zt=1,subscript𝑧𝑡1z_{t}=-1,italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - 1 , implying that TTκ/8superscriptsubscript𝑇𝑇𝜅8\mathscr{E}_{T}^{-}\geq T\kappa/8script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ≥ italic_T italic_κ / 8.

But this implies that

max(𝔼(T),𝔼+(𝒮T+))Tκ8max(+(𝖠),(𝖠c))Tκ32exp(Tκ2/2).superscript𝔼superscriptsubscript𝑇superscript𝔼superscriptsubscript𝒮𝑇𝑇𝜅8superscript𝖠superscriptsuperscript𝖠𝑐𝑇𝜅32𝑇superscript𝜅22\max(\mathbb{E}^{-}(\mathscr{E}_{T}^{-}),\mathbb{E}^{+}(\mathscr{S}_{T}^{+}))% \geq\frac{T\kappa}{8}\max(\mathbb{P}^{+}(\mathsf{A}),\mathbb{P}^{-}(\mathsf{A}% ^{c}))\geq\frac{T\kappa}{32}\exp(-T\kappa^{2}/2).roman_max ( blackboard_E start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) , blackboard_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) ) ≥ divide start_ARG italic_T italic_κ end_ARG start_ARG 8 end_ARG roman_max ( blackboard_P start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( sansserif_A ) , blackboard_P start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( sansserif_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ) ≥ divide start_ARG italic_T italic_κ end_ARG start_ARG 32 end_ARG roman_exp ( - italic_T italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) .

For T16,𝑇16T\geq 16,italic_T ≥ 16 , we may choose κ=1/T<1/4𝜅1𝑇14\kappa=1/\sqrt{T}<1/4italic_κ = 1 / square-root start_ARG italic_T end_ARG < 1 / 4 to conclude that in at least one instance, the safety or efficacy regret incurred must be at least T/(32e1/2)T/64.𝑇32superscript𝑒12𝑇64\sqrt{T}/(32e^{1/2})\geq\sqrt{T}/64.square-root start_ARG italic_T end_ARG / ( 32 italic_e start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) ≥ square-root start_ARG italic_T end_ARG / 64 .

F.2 Necessity of Dependence on Gaps.

We conclude the theoretical part of this paper by showing Theorem 7.4, via a reduction to prior lower bounds on the safe multi-armed bandit problem (Chen et al., 2022).

The safe MAB problem is parametrised by d𝑑ditalic_d arms with mean rewards μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and mean safety risks νksubscript𝜈𝑘\nu_{k}italic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT each. The optimal arm, ksuperscript𝑘k^{*}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT has reward μsubscript𝜇\mu_{*}italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT and the safety risk ny<α𝑛subscript𝑦𝛼ny_{*}<\alphaitalic_n italic_y start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT < italic_α. The associated efficacy and safety gaps are Δk:=(μμk)+assignsubscriptΔ𝑘subscriptsubscript𝜇subscript𝜇𝑘\Delta_{k}:=(\mu_{*}-\mu_{k})_{+}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := ( italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and Γk:=(νkα)+.assignsubscriptΓ𝑘subscriptsubscript𝜈𝑘𝛼\Gamma_{k}:=(\nu_{k}-\alpha)_{+}.roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := ( italic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_α ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT . In each round, the leaner is required to select one arm, and observes bounded signals with the above mean for both the rewards and safety. Implicitly, this can be thought of as a linear bandit setting, with the known constraints being that x𝑥xitalic_x lies in a simplex, the reward vector θ,𝜃\theta,italic_θ , and the constraint vector a𝑎aitalic_a. This reduction, however, is not completely correct: in the safe MAB problem, the actions are required to lie entirely on the corner points of the simplex, and we are not allowed to play in the interior. While it is standard to view x𝑥xitalic_x as a probability of selecting each arm in a MAB instance, this reduction fails due to the nonlinearity in our metrics. Indeed, the safe MAB problem considers the metrics

T𝖬𝖠𝖡:=(μμAt)+,𝒮T𝖬𝖠𝖡:=(νAtα)+.formulae-sequenceassignsuperscriptsubscript𝑇𝖬𝖠𝖡subscriptsuperscript𝜇subscript𝜇subscript𝐴𝑡assignsuperscriptsubscript𝒮𝑇𝖬𝖠𝖡subscriptsubscript𝜈subscript𝐴𝑡𝛼\mathscr{E}_{T}^{\mathsf{MAB}}:=\sum(\mu^{*}-\mu_{A_{t}})_{+},\mathscr{S}_{T}^% {\mathsf{MAB}}:=\sum(\nu_{A_{t}}-\alpha)_{+}.script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_MAB end_POSTSUPERSCRIPT := ∑ ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_MAB end_POSTSUPERSCRIPT := ∑ ( italic_ν start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_α ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT .

As a result, if the optimum of the SLB problem lies away from the corner points of the simplex, then the SLB problem can incur low regret, while the corresponding MAB actions would incur linear regret. Nevertheless, we shall argue below that for carefully designed instances, a low regret in the linear bandit problem does ensure nontrivial regret in the safe MAB problem.

The main result we shall use is the following, which is a mild variation of Proposition 6 of Chen et al. (2022), and can be shown using their proof.

Lemma F.1.

Let f:[0,):𝑓0f:\mathbb{N}\to[0,\infty)italic_f : blackboard_N → [ 0 , ∞ ) be any function fixed function such that f(T)T𝑓𝑇𝑇f(T)\leq Titalic_f ( italic_T ) ≤ italic_T for all T𝑇Titalic_T. If an algorithm ensures that for every safe MAB instance, suboptimal arms are not played more than f(T)𝑓𝑇f(T)italic_f ( italic_T ) times in expectation, then for every θ,a,𝜃𝑎\theta,a,italic_θ , italic_a , there exists a choice of arm distributions for the safe MAB instance for which the means are as described, and the number of times each suboptimal arm k𝑘kitalic_k is played is lower bounded in expectation as

𝔼[NTk]1(d(μkμ)𝟙{μk<μ}+d(νkα)𝟙{νk>α})((1f(T)/T)logTf(T)log(2)),𝔼delimited-[]superscriptsubscript𝑁𝑇𝑘1𝑑conditionalsubscript𝜇𝑘subscript𝜇1subscript𝜇𝑘subscript𝜇𝑑conditionalsubscript𝜈𝑘𝛼1subscript𝜈𝑘𝛼1𝑓𝑇𝑇𝑇𝑓𝑇2\mathbb{E}[N_{T}^{k}]\geq\frac{1}{(d(\mu_{k}\|\mu_{*})\mathds{1}\{\mu_{k}<\mu_% {*}\}+d(\nu_{k}\|\alpha)\mathds{1}\{\nu_{k}>\alpha\})}\cdot\left((1-f(T)/T)% \log\frac{T}{f(T)}-\log(2)\right),blackboard_E [ italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] ≥ divide start_ARG 1 end_ARG start_ARG ( italic_d ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) blackboard_1 { italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT } + italic_d ( italic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_α ) blackboard_1 { italic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > italic_α } ) end_ARG ⋅ ( ( 1 - italic_f ( italic_T ) / italic_T ) roman_log divide start_ARG italic_T end_ARG start_ARG italic_f ( italic_T ) end_ARG - roman_log ( 2 ) ) ,

where d(uv)𝑑conditional𝑢𝑣d(u\|v)italic_d ( italic_u ∥ italic_v ) is the KL divergence between Bernoulli laws with means u𝑢uitalic_u and v𝑣vitalic_v. In particular, these distributions are simply Bernoulli laws with the above means.

Our argument for the linear bandit proceeds thus. We shall carefully design a safe linear bandit instance for which we essentially provide multi-armed bandit feedback by using the standard reduction that each coordinate of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the probability of pulling the corresponding arm. We shall show that in the selected instance, achieving low linear regret ensures that the MAB regret is controlled (although to a weaker level). Then exploiting the above lower bound, we shall argue that the regret of the safe linear bandit cannot be too good, since it would violate the above lower bound.

Proof of Theorem 7.4.

We first carefully describe our main constructions for the SLB and MAB, form a crude bound that allows us to use Lemma F.1, and then refine the analysis to show effective lower bounds on the SLB regret.

SLB Instance.

We work with d=2𝑑2d=2italic_d = 2 with a single unknown constraint. Let θ=(θ1,θ2)𝜃subscript𝜃1subscript𝜃2\theta=(\theta_{1},\theta_{2})italic_θ = ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and a1=(α,a21)superscript𝑎1𝛼subscriptsuperscript𝑎12a^{1}=(\alpha,a^{1}_{2})italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = ( italic_α , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) be vectors in [0,1]2superscript012[0,1]^{2}[ 0 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT such that θ2>θ1>0,a21>α>0formulae-sequencesubscript𝜃2subscript𝜃10subscriptsuperscript𝑎12𝛼0\theta_{2}>\theta_{1}>0,a^{1}_{2}>\alpha>0italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_α > 0 and θ2α<θ1a21.subscript𝜃2𝛼subscript𝜃1subscriptsuperscript𝑎12\theta_{2}\alpha<\theta_{1}a^{1}_{2}.italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_α < italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . The safe bandit instance we design is

maxθ,x:x10,x20,x1+x21,a,xα,:𝜃𝑥formulae-sequencesubscript𝑥10formulae-sequencesubscript𝑥20formulae-sequencesubscript𝑥1subscript𝑥21𝑎𝑥𝛼\max\langle\theta,x\rangle:x_{1}\geq 0,x_{2}\geq 0,x_{1}+x_{2}\leq 1,\langle a% ,x\rangle\leq\alpha,roman_max ⟨ italic_θ , italic_x ⟩ : italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ 0 , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1 , ⟨ italic_a , italic_x ⟩ ≤ italic_α ,

where the last constraint is unknown and the rest are known. Let us call the three known constraints a2,a3,a4superscript𝑎2superscript𝑎3superscript𝑎4a^{2},a^{3},a^{4}italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. There are 6 BISs, with the associated points and gaps shown in Table 1 below. The only points meeting the constraints a3superscript𝑎3a^{3}italic_a start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and a4superscript𝑎4a^{4}italic_a start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT is (0,1)01(0,1)( 0 , 1 ), which is infeasible. Note that the situation is highly degenerate at the optimal point is x=(1,0)superscript𝑥10x^{*}=(1,0)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( 1 , 0 ) and three distinct BISs activate it. Nevertheless, each of these BISs is full rank. Further, since the algorithm ensures that 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and Tsubscript𝑇\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are both O(T)𝑂𝑇O(\sqrt{T})italic_O ( square-root start_ARG italic_T end_ARG ) in general, our discussion below is effective.

Table 1: Description of BISs in our construction.
BIS Activating Point ζ(I)subscript𝜁𝐼\zeta_{*}(I)italic_ζ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I ) η(I)subscript𝜂𝐼\eta_{*}(I)italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_I )
{1,2}12\{1,2\}{ 1 , 2 } (0,α/a21)0𝛼subscriptsuperscript𝑎12(0,\alpha/a^{1}_{2})( 0 , italic_α / italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) 0 (θ1a21αθ2)/(a21+θ2)subscript𝜃1subscriptsuperscript𝑎12𝛼subscript𝜃2subscriptsuperscript𝑎12subscript𝜃2(\theta_{1}a^{1}_{2}-\alpha\theta_{2})/(a^{1}_{2}+\theta_{2})( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) / ( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
{1,3}13\{1,3\}{ 1 , 3 } (1,0)10(1,0)( 1 , 0 ) 0 0
{1,4}14\{1,4\}{ 1 , 4 } (1,0)10(1,0)( 1 , 0 ) 0 0
{2,3}23\{2,3\}{ 2 , 3 } (0,0)00(0,0)( 0 , 0 ) 0 θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
{2,4}24\{2,4\}{ 2 , 4 } (1,0)10(1,0)( 1 , 0 ) 0 0
{3,4}34\{3,4\}{ 3 , 4 } \emptyset a21αsubscriptsuperscript𝑎12𝛼a^{1}_{2}-\alphaitalic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α 0

The gap of this instance is

Γ:=min(θ1,a21α,θ1a21αθ2a21+θ2).assignΓsubscript𝜃1subscriptsuperscript𝑎12𝛼subscript𝜃1subscriptsuperscript𝑎12𝛼subscript𝜃2subscriptsuperscript𝑎12subscript𝜃2\Gamma:=\min\left(\theta_{1},a^{1}_{2}-\alpha,\frac{\theta_{1}a^{1}_{2}-\alpha% \theta_{2}}{a^{1}_{2}+\theta_{2}}\right).roman_Γ := roman_min ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α , divide start_ARG italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) .

Our construction requires that this is at least Ω(min(θ1,a21α))Ωsubscript𝜃1subscriptsuperscript𝑎12𝛼\Omega(\min(\theta_{1},a^{1}_{2}-\alpha))roman_Ω ( roman_min ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α ) ). This can always be ensured, example, by using the parameterisation θ2=2θ1,a21=4θ1,α=θ1/2,formulae-sequencesubscript𝜃22subscript𝜃1formulae-sequencesubscriptsuperscript𝑎124subscript𝜃1𝛼subscript𝜃12\theta_{2}=2\theta_{1},a^{1}_{2}=4\theta_{1},\alpha=\theta_{1}/2,italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 4 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α = italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 2 , whence the expressions work out to

a21α=7θ1/2,θ1a21αθ2a21+θ2=3θ1/5,formulae-sequencesubscriptsuperscript𝑎12𝛼7subscript𝜃12subscript𝜃1subscriptsuperscript𝑎12𝛼subscript𝜃2subscriptsuperscript𝑎12subscript𝜃23subscript𝜃15a^{1}_{2}-\alpha=7\theta_{1}/2,\frac{\theta_{1}a^{1}_{2}-\alpha\theta_{2}}{a^{% 1}_{2}+\theta_{2}}=3\theta_{1}/5,italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α = 7 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 2 , divide start_ARG italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG = 3 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 5 ,

giving us Γθ1/2.Γsubscript𝜃12\Gamma\geq\theta_{1}/2.roman_Γ ≥ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 2 . We further impose the condition 4θ1<1/4.4subscript𝜃1144\theta_{1}<1/4.4 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < 1 / 4 . Thus, this instance lets us express every value of Γ<1/32.Γ132\Gamma<1/32.roman_Γ < 1 / 32 .

Safe MAB Instance.

Let us now describe the associated MAB instance. We work with three arms of means μ=(1/2+θ1,1/2+θ2,1/2)𝜇12subscript𝜃112subscript𝜃212\mu=(\nicefrac{{1}}{{2}}+\theta_{1},\nicefrac{{1}}{{2}}+\theta_{2},\nicefrac{{% 1}}{{2}})italic_μ = ( / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , / start_ARG 1 end_ARG start_ARG 2 end_ARG ) and risks (1/2+α,1/2+a21,1/2)12𝛼12subscriptsuperscript𝑎1212(\nicefrac{{1}}{{2}}+\alpha,\nicefrac{{1}}{{2}}+a^{1}_{2},\nicefrac{{1}}{{2}})( / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_α , / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , / start_ARG 1 end_ARG start_ARG 2 end_ARG ). In each case, the underlying laws are Benoullis with the associated mean, all taken to be independent, which forms the family of instances that underly Lemma F.1. The connection to the linear bandit instance is as follows: each time we pick (x1,x2),subscript𝑥1subscript𝑥2(x_{1},x_{2}),( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , we sample a random variable in {1,2,3}123\{1,2,3\}{ 1 , 2 , 3 } according to the pmf (x1,x2,1x1x2)subscript𝑥1subscript𝑥21subscript𝑥1subscript𝑥2(x_{1},x_{2},1-x_{1}-x_{2})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), pull the corresponding arm, and then supply the resulting rewards and risks with 1/212\nicefrac{{1}}{{2}}/ start_ARG 1 end_ARG start_ARG 2 end_ARG subtracted to the linear bandit instance. Note that this is an unbiased measurement of the mean for the linear bandit, since

𝔼[R]=x1(1/2+θ1)+x21/2+θ2)+x2(1/2)1/2=x1θ1+x2θ2\mathbb{E}[R]=x_{1}\cdot(\nicefrac{{1}}{{2}}+\theta_{1})+x_{2}\cdot\nicefrac{{% 1}}{{2}}+\theta_{2})+x_{2}\cdot(\nicefrac{{1}}{{2}})-\nicefrac{{1}}{{2}}=x_{1}% \theta_{1}+x_{2}\theta_{2}blackboard_E [ italic_R ] = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ ( / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ( / start_ARG 1 end_ARG start_ARG 2 end_ARG ) - / start_ARG 1 end_ARG start_ARG 2 end_ARG = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

and similarly for the safety risk. These 1/212\nicefrac{{1}}{{2}}/ start_ARG 1 end_ARG start_ARG 2 end_ARG are added to ensure because then the KL divergences appearing in the bound of Lemma F.1 take the form d(1/21/2+θ1)𝑑conditional1212subscript𝜃1d(\nicefrac{{1}}{{2}}\|\nicefrac{{1}}{{2}}+\theta_{1})italic_d ( / start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and d(1/2+a211/2+α)𝑑12conditionalsubscriptsuperscript𝑎1212𝛼d(\nicefrac{{1}}{{2}}+a^{1}_{2}\|\nicefrac{{1}}{{2}}+\alpha)italic_d ( / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_α ), and the arguments are bounded away from 00 and 1111, ensuring that the behaviour for small θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is quadratic rather than the potentially worse dependence near 00 and 1111. To ensure this good behaviour, we use that a21<1/4,subscriptsuperscript𝑎1214a^{1}_{2}<1/4,italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < 1 / 4 , due to which a21+1/2<13/14subscriptsuperscript𝑎12121314a^{1}_{2}+1/2<13/14italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 / 2 < 13 / 14 is bounded away from 1111, which is the origin of our condition θ11/16subscript𝜃1116\theta_{1}\leq 1/16italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 1 / 16 in the previous paragraph. The key observation is that in the safe MAB instance, 𝔼[NT2]=xt,2𝔼delimited-[]superscriptsubscript𝑁𝑇2subscript𝑥𝑡2\mathbb{E}[N_{T}^{2}]=\sum x_{t,2}blackboard_E [ italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = ∑ italic_x start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT and 𝔼[NT3]=(1xt,1xt,2),𝔼delimited-[]superscriptsubscript𝑁𝑇31subscript𝑥𝑡1subscript𝑥𝑡2\mathbb{E}[N_{T}^{3}]=\sum(1-x_{t,1}-x_{t,2}),blackboard_E [ italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ] = ∑ ( 1 - italic_x start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ) , where xt,ksubscript𝑥𝑡𝑘x_{t,k}italic_x start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT is the k𝑘kitalic_kth component of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Crude Bound.

We first show that as long as the algorithm ensures that max(T,𝒮T)=O(T1c),subscript𝑇subscript𝒮𝑇𝑂superscript𝑇1𝑐\max(\mathscr{E}_{T},\mathscr{S}_{T})=O(T^{1-c}),roman_max ( script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_O ( italic_T start_POSTSUPERSCRIPT 1 - italic_c end_POSTSUPERSCRIPT ) , the play of suboptimal arms in the MAB instance is at least Ω(θ12logT)Ωsuperscriptsubscript𝜃12𝑇\Omega(\theta_{1}^{-2}\log T)roman_Ω ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT roman_log italic_T ). Fix θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the above models. Suppose that the safe linear bandit ensures that Tg(T)subscript𝑇𝑔𝑇\mathscr{E}_{T}\leq g(T)script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≤ italic_g ( italic_T ) and 𝒮Tg(T)subscript𝒮𝑇𝑔𝑇\mathscr{S}_{T}\leq g(T)script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≤ italic_g ( italic_T ) for every instance, where g(T)T𝑔𝑇𝑇g(T)\leq Titalic_g ( italic_T ) ≤ italic_T is an arbitrary monotonic function. Let ζ>0𝜁0\zeta>0italic_ζ > 0 be a parameter that we will fix later. Then observe that if the linear bandit instance ever plays a point (x1,x2)subscript𝑥1subscript𝑥2(x_{1},x_{2})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) such that

a1,xα+ζorθ,xθ1ζ,formulae-sequencesuperscript𝑎1𝑥𝛼𝜁or𝜃𝑥subscript𝜃1𝜁\langle a^{1},x\rangle\geq\alpha+\zeta\quad\textrm{or}\quad\langle\theta,x% \rangle\leq\theta_{1}-\zeta,⟨ italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x ⟩ ≥ italic_α + italic_ζ or ⟨ italic_θ , italic_x ⟩ ≤ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_ζ ,

then it would incur a point wise cost of at least γ𝛾\gammaitalic_γ in the round, for either Tsubscript𝑇\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT or 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. This means that the number of rounds in which it plays such points is bounded as g(T)/ζ.𝑔𝑇𝜁g(T)/\zeta.italic_g ( italic_T ) / italic_ζ . So, in at least max(Tg(T)/ζ,0)𝑇𝑔𝑇𝜁0\max(T-g(T)/\zeta,0)roman_max ( italic_T - italic_g ( italic_T ) / italic_ζ , 0 ) rounds, the safe linear bandit instance plays in the region

Pζ:={a1,xα+ζ,θ,xθ1ζ,x10,x20,x1+x21}.assignsubscript𝑃𝜁formulae-sequencesuperscript𝑎1𝑥𝛼𝜁formulae-sequence𝜃𝑥subscript𝜃1𝜁formulae-sequencesubscript𝑥10formulae-sequencesubscript𝑥20subscript𝑥1subscript𝑥21P_{\zeta}:=\{\langle a^{1},x\rangle\leq\alpha+\zeta,\langle\theta,x\rangle\geq% \theta_{1}-\zeta,x_{1}\geq 0,x_{2}\geq 0,x_{1}+x_{2}\leq 1\}.italic_P start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT := { ⟨ italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x ⟩ ≤ italic_α + italic_ζ , ⟨ italic_θ , italic_x ⟩ ≥ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_ζ , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ 0 , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1 } .

Now notice that both x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and x1+x2subscript𝑥1subscript𝑥2x_{1}+x_{2}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are upper bounded in this region. Indeed, the corner points of this region are

(1ζθ1,0),(1ζa21α,ζa21α),(1ζθ1{1+θ2(θ1+α)θ1(θ1a21αθ2)},θ1+αθ1a21αθ2ζθ1),(1,0),1subscript𝜁𝜃101𝜁subscriptsuperscript𝑎12𝛼𝜁subscriptsuperscript𝑎12𝛼1𝜁subscript𝜃11subscript𝜃2subscript𝜃1𝛼subscript𝜃1subscript𝜃1subscriptsuperscript𝑎12𝛼subscript𝜃2subscript𝜃1𝛼subscript𝜃1subscriptsuperscript𝑎12𝛼subscript𝜃2𝜁subscript𝜃110\left(1-\frac{\zeta}{\theta}_{1},0\right),\left(1-\frac{\zeta}{a^{1}_{2}-% \alpha},\frac{\zeta}{a^{1}_{2}-\alpha}\right),\left(1-\frac{\zeta}{\theta_{1}}% \left\{1+\frac{\theta_{2}(\theta_{1}+\alpha)}{\theta_{1}(\theta_{1}a^{1}_{2}-% \alpha\theta_{2})}\right\},\frac{\theta_{1}+\alpha}{\theta_{1}a^{1}_{2}-\alpha% \theta_{2}}\frac{\zeta}{\theta_{1}}\right),(1,0),( 1 - divide start_ARG italic_ζ end_ARG start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 0 ) , ( 1 - divide start_ARG italic_ζ end_ARG start_ARG italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α end_ARG , divide start_ARG italic_ζ end_ARG start_ARG italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α end_ARG ) , ( 1 - divide start_ARG italic_ζ end_ARG start_ARG italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG { 1 + divide start_ARG italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_α ) end_ARG start_ARG italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG } , divide start_ARG italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_α end_ARG start_ARG italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG divide start_ARG italic_ζ end_ARG start_ARG italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) , ( 1 , 0 ) ,

and so ensuring that a21α,θ1a21αθ2=Ω(θ1),subscriptsuperscript𝑎12𝛼subscript𝜃1subscriptsuperscript𝑎12𝛼subscript𝜃2Ωsubscript𝜃1a^{1}_{2}-\alpha,\theta_{1}a^{1}_{2}-\alpha\theta_{2}=\Omega(\theta_{1}),italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α , italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_Ω ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , we have

xPζx2ζ/θ1,(1x1x2)=O(ζ/θ1).formulae-sequence𝑥subscript𝑃𝜁subscript𝑥2𝜁subscript𝜃11subscript𝑥1subscript𝑥2𝑂𝜁subscript𝜃1x\in P_{\zeta}\implies x_{2}\leq\zeta/\theta_{1},(1-x_{1}-x_{2})=O(\zeta/% \theta_{1}).italic_x ∈ italic_P start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT ⟹ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ζ / italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_O ( italic_ζ / italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) .

Of course, outside of Pζ,subscript𝑃𝜁P_{\zeta},italic_P start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT , x21,1x1x21formulae-sequencesubscript𝑥211subscript𝑥1subscript𝑥21x_{2}\leq 1,1-x_{1}-x_{2}\leq 1italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1 , 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1. The calculation holds no matter the ζ𝜁\zetaitalic_ζ we chose so long as ζθ1much-less-than𝜁subscript𝜃1\zeta\ll\theta_{1}italic_ζ ≪ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. This means that for every ζ=O(θ1),𝜁𝑂subscript𝜃1\zeta=O(\theta_{1}),italic_ζ = italic_O ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ,

𝔼[NT2]𝔼delimited-[]superscriptsubscript𝑁𝑇2\displaystyle\mathbb{E}[N_{T}^{2}]blackboard_E [ italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =𝔼[xt,2]O(ζ/θ1)T+g(T)ζabsent𝔼delimited-[]subscript𝑥𝑡2𝑂𝜁subscript𝜃1𝑇𝑔𝑇𝜁\displaystyle=\sum\mathbb{E}[x_{t,2}]\leq O(\zeta/\theta_{1})T+\frac{g(T)}{\zeta}= ∑ blackboard_E [ italic_x start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ] ≤ italic_O ( italic_ζ / italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_T + divide start_ARG italic_g ( italic_T ) end_ARG start_ARG italic_ζ end_ARG
𝔼[NT3]𝔼delimited-[]superscriptsubscript𝑁𝑇3\displaystyle\mathbb{E}[N_{T}^{3}]blackboard_E [ italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ] =𝔼[(1xt,1xt,2)]O(ζ/θ1)T+g(T)ζ,absent𝔼delimited-[]1subscript𝑥𝑡1subscript𝑥𝑡2𝑂𝜁subscript𝜃1𝑇𝑔𝑇𝜁\displaystyle=\sum\mathbb{E}[(1-x_{t,1}-x_{t,2})]\leq O(\zeta/\theta_{1})T+% \frac{g(T)}{\zeta},= ∑ blackboard_E [ ( 1 - italic_x start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ) ] ≤ italic_O ( italic_ζ / italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_T + divide start_ARG italic_g ( italic_T ) end_ARG start_ARG italic_ζ end_ARG ,

i.e., we have shown that the safe MAB incurs regret bounds of at most f(T)=O(ζT)+g(T)θ1/ζ𝑓𝑇𝑂𝜁𝑇𝑔𝑇subscript𝜃1𝜁f(T)=O(\zeta T)+g(T)\theta_{1}/\zetaitalic_f ( italic_T ) = italic_O ( italic_ζ italic_T ) + italic_g ( italic_T ) italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_ζ for both Tsubscript𝑇\mathscr{E}_{T}script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and 𝒮Tsubscript𝒮𝑇\mathscr{S}_{T}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

Since g(T)CT1c𝑔𝑇𝐶superscript𝑇1𝑐g(T)\leq CT^{1-c}italic_g ( italic_T ) ≤ italic_C italic_T start_POSTSUPERSCRIPT 1 - italic_c end_POSTSUPERSCRIPT for some constants C,c,𝐶𝑐C,c,italic_C , italic_c , by taking ζ=Tc/2,𝜁superscript𝑇𝑐2\zeta=T^{-c/2},italic_ζ = italic_T start_POSTSUPERSCRIPT - italic_c / 2 end_POSTSUPERSCRIPT , for large enough t𝑡titalic_t, we thus have the low-regret bound max(𝔼[T𝖬𝖠𝖡],𝔼[𝒮T𝖬𝖠𝖡])CT1c/2.𝔼delimited-[]superscriptsubscript𝑇𝖬𝖠𝖡𝔼delimited-[]superscriptsubscript𝒮𝑇𝖬𝖠𝖡𝐶superscript𝑇1𝑐2\max(\mathbb{E}[\mathscr{E}_{T}^{\mathsf{MAB}}],\mathbb{E}[\mathscr{S}_{T}^{% \mathsf{MAB}}])\leq CT^{1-c/2}.roman_max ( blackboard_E [ script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_MAB end_POSTSUPERSCRIPT ] , blackboard_E [ script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_MAB end_POSTSUPERSCRIPT ] ) ≤ italic_C italic_T start_POSTSUPERSCRIPT 1 - italic_c / 2 end_POSTSUPERSCRIPT . But then, by Lemma F.1, it must follow that as T,𝑇T\to\infty,italic_T → ∞ ,

𝔼[NT2]1d(1/2+4θ11/2+θ1/2)((1o(1))c2logTO(1))=Ω(θ12logT),𝔼delimited-[]superscriptsubscript𝑁𝑇21𝑑12conditional4subscript𝜃112subscript𝜃121𝑜1𝑐2𝑇𝑂1Ωsuperscriptsubscript𝜃12𝑇\displaystyle\mathbb{E}[N_{T}^{2}]\geq\frac{1}{d(\nicefrac{{1}}{{2}}+4\theta_{% 1}\|\nicefrac{{1}}{{2}}+\theta_{1}/2)}\left((1-o(1))\frac{c}{2}\log T-O(1)% \right)=\Omega(\theta_{1}^{-2}\log T),blackboard_E [ italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ divide start_ARG 1 end_ARG start_ARG italic_d ( / start_ARG 1 end_ARG start_ARG 2 end_ARG + 4 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 2 ) end_ARG ( ( 1 - italic_o ( 1 ) ) divide start_ARG italic_c end_ARG start_ARG 2 end_ARG roman_log italic_T - italic_O ( 1 ) ) = roman_Ω ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT roman_log italic_T ) ,
or 𝔼[NT3]1d(1/21/2+θ1)((1o(1))c2logTO(1))=Ω(θ12logT)or 𝔼delimited-[]superscriptsubscript𝑁𝑇31𝑑conditional1212subscript𝜃11𝑜1𝑐2𝑇𝑂1Ωsuperscriptsubscript𝜃12𝑇\displaystyle\textrm{or }\mathbb{E}[N_{T}^{3}]\geq\frac{1}{d(\nicefrac{{1}}{{2% }}\|\nicefrac{{1}}{{2}}+\theta_{1})}\left((1-o(1))\frac{c}{2}\log T-O(1)\right% )=\Omega(\theta_{1}^{-2}\log T)or blackboard_E [ italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ] ≥ divide start_ARG 1 end_ARG start_ARG italic_d ( / start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ( ( 1 - italic_o ( 1 ) ) divide start_ARG italic_c end_ARG start_ARG 2 end_ARG roman_log italic_T - italic_O ( 1 ) ) = roman_Ω ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT roman_log italic_T )

To use these bounds effectively, we employ a computer algebra system to argue that888Observe that since the divergences considered are minimised to 00 at θ1=0,subscript𝜃10\theta_{1}=0,italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 , the local behaviour for small θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is quadratic. Further, the function is smooth in θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Thus, for large enough K𝐾Kitalic_K, there exists an interval [0,θ1(K)]0subscript𝜃1𝐾[0,\theta_{1}(K)][ 0 , italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_K ) ] such that for any x𝑥xitalic_x in this region, d(1/2+ux1/2+vx)Kx2𝑑12conditional𝑢𝑥12𝑣𝑥𝐾superscript𝑥2d(1/2+ux\|1/2+vx)\leq Kx^{2}italic_d ( 1 / 2 + italic_u italic_x ∥ 1 / 2 + italic_v italic_x ) ≤ italic_K italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We simply plugged in various constants for K𝐾Kitalic_K until we found that θ1(27)1/16subscript𝜃127116\theta_{1}(27)\geq\nicefrac{{1}}{{16}}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 27 ) ≥ / start_ARG 1 end_ARG start_ARG 16 end_ARG.

θ11/16,d(1/2+4θ11/2+θ1/2)27θ12,d(1/21/2+θ1)27θ12.formulae-sequencefor-allsubscript𝜃1116formulae-sequence𝑑12conditional4subscript𝜃112subscript𝜃1227superscriptsubscript𝜃12𝑑conditional1212subscript𝜃127superscriptsubscript𝜃12\forall\theta_{1}\leq\nicefrac{{1}}{{16}},d(\nicefrac{{1}}{{2}}+4\theta_{1}\|% \nicefrac{{1}}{{2}}+\theta_{1}/2)\leq 27\theta_{1}^{2},d(\nicefrac{{1}}{{2}}\|% \nicefrac{{1}}{{2}}+\theta_{1})\leq 27\theta_{1}^{2}.∀ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ / start_ARG 1 end_ARG start_ARG 16 end_ARG , italic_d ( / start_ARG 1 end_ARG start_ARG 2 end_ARG + 4 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 2 ) ≤ 27 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_d ( / start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≤ 27 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Concretely, then the bounds above yield

𝔼[NT2+NT3]c27θ12((1o(1))logT1clog(4)),𝔼delimited-[]superscriptsubscript𝑁𝑇2superscriptsubscript𝑁𝑇3𝑐27superscriptsubscript𝜃121𝑜1𝑇1𝑐4\mathbb{E}[N_{T}^{2}+N_{T}^{3}]\geq\frac{c}{27\theta_{1}^{2}}\left((1-o(1))% \log T-\frac{1}{c}\log(4)\right),blackboard_E [ italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ] ≥ divide start_ARG italic_c end_ARG start_ARG 27 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ( 1 - italic_o ( 1 ) ) roman_log italic_T - divide start_ARG 1 end_ARG start_ARG italic_c end_ARG roman_log ( 4 ) ) ,

where the o(1)𝑜1o(1)italic_o ( 1 ) term is C/Tc/2𝐶superscript𝑇𝑐2C/T^{c/2}italic_C / italic_T start_POSTSUPERSCRIPT italic_c / 2 end_POSTSUPERSCRIPT.

Note again that this bound is effective for our case since the method doss does achieve max(T,𝒮T)=O~(T)subscript𝑇subscript𝒮𝑇~𝑂𝑇\max(\mathscr{E}_{T},\mathscr{S}_{T})=\tilde{O}(\sqrt{T})roman_max ( script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) with high probability, in which case we can set c=1/2+γ𝑐12𝛾c=\nicefrac{{1}}{{2}}+\gammaitalic_c = / start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_γ for any γ>0𝛾0\gamma>0italic_γ > 0 in the above.

Lower Bounds on SLB.

Let us now come to showing the claims. We select the instance θ2=2θ1,a21=4θ1,α=θ1/2formulae-sequencesubscript𝜃22subscript𝜃1formulae-sequencesubscriptsuperscript𝑎124subscript𝜃1𝛼subscript𝜃12\theta_{2}=2\theta_{1},a^{1}_{2}=4\theta_{1},\alpha=\theta_{1}/2italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 4 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α = italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 2. Notice that in this case, the gaps are (θ1,7θ1/2,3θ1/5)subscript𝜃17subscript𝜃123subscript𝜃15(\theta_{1},7\theta_{1}/2,3\theta_{1}/5)( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 7 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 2 , 3 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 5 ), and so Γθ1/2Γsubscript𝜃12\Gamma\geq\theta_{1}/2roman_Γ ≥ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 2. Further, θ1a21αθ2=3θ1,subscript𝜃1subscriptsuperscript𝑎12𝛼subscript𝜃23subscript𝜃1\theta_{1}a^{1}_{2}-\alpha\theta_{2}=3\theta_{1},italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , and so the claim on Pζsubscript𝑃𝜁P_{\zeta}italic_P start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT above remains valid for all ζθ1,𝜁subscript𝜃1\zeta\leq\theta_{1},italic_ζ ≤ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , and so against this instance, the above lower bounds on 𝔼[NT2]+𝔼[NT3]𝔼delimited-[]superscriptsubscript𝑁𝑇2𝔼delimited-[]superscriptsubscript𝑁𝑇3\mathbb{E}[N_{T}^{2}]+\mathbb{E}[N_{T}^{3}]blackboard_E [ italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + blackboard_E [ italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ].

But, observe that for any choice of x1,x2,subscript𝑥1subscript𝑥2x_{1},x_{2},italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , it holds that the instantaneous efficacy regret and safety violations are

(θ1θ1x1θ2x2)+subscriptsubscript𝜃1subscript𝜃1subscript𝑥1subscript𝜃2subscript𝑥2\displaystyle(\theta_{1}-\theta_{1}x_{1}-\theta_{2}x_{2})_{+}( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT =(θ1(1x1x2)(θ2θ1)x2)+=θ1((1x1x2)x2)+absentsubscriptsubscript𝜃11subscript𝑥1subscript𝑥2subscript𝜃2subscript𝜃1subscript𝑥2subscript𝜃1subscript1subscript𝑥1subscript𝑥2subscript𝑥2\displaystyle=(\theta_{1}(1-x_{1}-x_{2})-(\theta_{2}-\theta_{1})x_{2})_{+}=% \theta_{1}((1-x_{1}-x_{2})-x_{2})_{+}= ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ( italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT
(αx1+a21x2α)+subscript𝛼subscript𝑥1subscriptsuperscript𝑎12subscript𝑥2𝛼\displaystyle(\alpha x_{1}+a^{1}_{2}x_{2}-\alpha)_{+}( italic_α italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT =((a21α)x2α(1x1x2))+=θ12(7x2(1x1x2))+absentsubscriptsubscriptsuperscript𝑎12𝛼subscript𝑥2𝛼1subscript𝑥1subscript𝑥2subscript𝜃12subscript7subscript𝑥21subscript𝑥1subscript𝑥2\displaystyle=((a^{1}_{2}-\alpha)x_{2}-\alpha(1-x_{1}-x_{2}))_{+}=\frac{\theta% _{1}}{2}(7x_{2}-(1-x_{1}-x_{2}))_{+}= ( ( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α ) italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = divide start_ARG italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ( 7 italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT

But notice that the only way both of these quantities are 00 is if x2(1x1x2)7x2x2=1x1x2=0x1=1iffsubscript𝑥21subscript𝑥1subscript𝑥27subscript𝑥2subscript𝑥21subscript𝑥1subscript𝑥20subscript𝑥11x_{2}\geq(1-x_{1}-x_{2})\geq 7x_{2}\implies x_{2}=1-x_{1}-x_{2}=0\iff x_{1}=1italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≥ 7 italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟹ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 ⇔ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1. So, in any round such that x11,subscript𝑥11x_{1}\neq 1,italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ 1 , at least one of these quantities is nonzero. More quantitatively, we have

𝔼[T]+𝔼[𝒮T]𝔼delimited-[]subscript𝑇𝔼delimited-[]subscript𝒮𝑇\displaystyle\mathbb{E}[\mathscr{E}_{T}]+\mathbb{E}[\mathscr{S}_{T}]blackboard_E [ script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] + blackboard_E [ script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] 𝔼[(θ1θ1xt,1θ2xt,2)+(αxt,1a21xt,2α)]absent𝔼delimited-[]subscript𝜃1subscript𝜃1subscript𝑥𝑡1subscript𝜃2subscript𝑥𝑡2𝛼subscript𝑥𝑡1subscriptsuperscript𝑎12subscript𝑥𝑡2𝛼\displaystyle\geq\sum\mathbb{E}[(\theta_{1}-\theta_{1}x_{t,1}-\theta_{2}x_{t,2% })+(\alpha x_{t,1}-a^{1}_{2}x_{t,2}-\alpha)]≥ ∑ blackboard_E [ ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ) + ( italic_α italic_x start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT - italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT - italic_α ) ]
θ152𝔼[xt,2]+12𝔼[(1xt,1xt,2)]absentsubscript𝜃152𝔼delimited-[]subscript𝑥𝑡212𝔼delimited-[]1subscript𝑥𝑡1subscript𝑥𝑡2\displaystyle\geq\theta_{1}\sum\frac{5}{2}\mathbb{E}[x_{t,2}]+\frac{1}{2}% \mathbb{E}[(1-x_{t,1}-x_{t,2})]≥ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ divide start_ARG 5 end_ARG start_ARG 2 end_ARG blackboard_E [ italic_x start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E [ ( 1 - italic_x start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ) ]
θ12(𝔼[NT2]+𝔼[NT3])c(1o(1)54θ1log(T)O(1),\displaystyle\geq\frac{\theta_{1}}{2}(\mathbb{E}[N_{T}^{2}]+\mathbb{E}[N_{T}^{% 3}])\geq\frac{c(1-o(1)}{54\theta_{1}}\log(T)-O(1),≥ divide start_ARG italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ( blackboard_E [ italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + blackboard_E [ italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ] ) ≥ divide start_ARG italic_c ( 1 - italic_o ( 1 ) end_ARG start_ARG 54 italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG roman_log ( italic_T ) - italic_O ( 1 ) ,

which yields the result upon recalling that θ1Γθ1/2subscript𝜃1Γsubscript𝜃12\theta_{1}\geq\Gamma\geq\theta_{1}/2italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ roman_Γ ≥ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 2.

Appendix G Alternative Safety Metrics

We briefly investigate the behaviour of alternative safety metrics of the form

𝒮Tf:=tTf(maxi(ai,xtαi)+),\mathscr{S}_{T}^{f}:=\sum_{t\leq T}f(\max_{i}(\langle a^{i},x_{t}\rangle-% \alpha^{i})_{+}),script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT := ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT italic_f ( roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) ,

where f𝑓fitalic_f is some increasing hhitalic_h-Hölder continuous map such that limx0f(x)=0subscript𝑥0𝑓𝑥0\lim_{x\searrow 0}f(x)=0roman_lim start_POSTSUBSCRIPT italic_x ↘ 0 end_POSTSUBSCRIPT italic_f ( italic_x ) = 0. Note that this section should be read after the reader is familiar with our typical proof techniques.

Note that due to our assumption that ai1,x1,formulae-sequencenormsuperscript𝑎𝑖1norm𝑥1\|a^{i}\|\leq 1,\|x\|\leq 1,∥ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ ≤ 1 , ∥ italic_x ∥ ≤ 1 , it follows that (ai,xtαi)+2::subscriptsuperscript𝑎𝑖subscript𝑥𝑡superscript𝛼𝑖2absent(\langle a^{i},x_{t}\rangle-\alpha^{i})_{+}\leq 2:( ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ≤ 2 : since the problem is feasible, and |ai,x|1,superscript𝑎𝑖superscript𝑥1|\langle a^{i},x^{*}\rangle|\leq 1,| ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ | ≤ 1 , it follows that 𝟏Axα,1𝐴superscript𝑥𝛼-\mathbf{1}\leq Ax^{*}\leq\alpha,- bold_1 ≤ italic_A italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_α , and so ai,xαi1(1)superscript𝑎𝑖𝑥superscript𝛼𝑖11\langle a^{i},x\rangle-\alpha^{i}\leq 1-(-1)⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≤ 1 - ( - 1 ). Thus, only the behaviour of f𝑓fitalic_f over [0,2]02[0,2][ 0 , 2 ] matters.

Now, since f𝑓fitalic_f is Hölder continuous, and f(0+)=0,𝑓superscript00f(0^{+})=0,italic_f ( 0 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) = 0 , its behaviour near 00 is as f(x)Cxh𝑓𝑥𝐶superscript𝑥f(x)\leq Cx^{h}italic_f ( italic_x ) ≤ italic_C italic_x start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT. In this case, we may as well study the behaviour of fh:=xxh,assignsubscript𝑓𝑥maps-tosuperscript𝑥f_{h}:=x\mapsto x^{h},italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT := italic_x ↦ italic_x start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , which we argue determines the lower and upper bounds. Before proceeding, note that we may uniformly bound the noise scale by 2222, since we know that roundwise inefficacy or constraint violation can be at most 2222.

Now, for the penalty 𝒮Th=tT(maxiai,xtαi)+h,superscriptsubscript𝒮𝑇subscript𝑡𝑇superscriptsubscriptsubscript𝑖superscript𝑎𝑖subscript𝑥𝑡superscript𝛼𝑖\mathscr{S}_{T}^{h}=\sum_{t\leq T}(\max_{i}\langle a^{i},x_{t}\rangle-\alpha^{% i})_{+}^{h},script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , observe that if h2,2h\geq 2,italic_h ≥ 2 , we can direclty bound the behaviour of 𝒮Thsuperscriptsubscript𝒮𝑇\mathscr{S}_{T}^{h}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT using

𝒮Thρth(xt;δ)2h2ρt2(xt;δ)2h2O(d2log2T).superscriptsubscript𝒮𝑇superscriptsubscript𝜌𝑡subscript𝑥𝑡𝛿superscript22superscriptsubscript𝜌𝑡2subscript𝑥𝑡𝛿superscript22𝑂superscript𝑑2superscript2𝑇\mathscr{S}_{T}^{h}\leq\sum\rho_{t}^{h}(x_{t};\delta)\leq 2^{h-2}\sum\rho_{t}^% {2}(x_{t};\delta)\leq 2^{h-2}\cdot O(d^{2}\log^{2}T).script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ≤ ∑ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ≤ 2 start_POSTSUPERSCRIPT italic_h - 2 end_POSTSUPERSCRIPT ∑ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ≤ 2 start_POSTSUPERSCRIPT italic_h - 2 end_POSTSUPERSCRIPT ⋅ italic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) .

Thus, the only interesting behaviour is when h<22h<2italic_h < 2.

For h(0,2),02h\in(0,2),italic_h ∈ ( 0 , 2 ) , applying Hölder’s inequality with p=2/h>1,𝑝21p=2/h>1,italic_p = 2 / italic_h > 1 , we have

𝒮Thρth(xt;δ)(t(ρth)2/h)h/2(12/(2h))1h/2=T1h/2O(dhloghT).superscriptsubscript𝒮𝑇superscriptsubscript𝜌𝑡subscript𝑥𝑡𝛿superscriptsubscript𝑡superscriptsuperscriptsubscript𝜌𝑡22superscriptsuperscript12212superscript𝑇12𝑂superscript𝑑superscript𝑇\mathscr{S}_{T}^{h}\leq\sum\rho_{t}^{h}(x_{t};\delta)\leq\left(\sum_{t}(\rho_{% t}^{h})^{2/h}\right)^{h/2}\cdot(\sum 1^{2/(2-h)})^{1-h/2}=T^{1-h/2}\cdot O(d^{% h}\log^{h}T).script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ≤ ∑ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ≤ ( ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 / italic_h end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_h / 2 end_POSTSUPERSCRIPT ⋅ ( ∑ 1 start_POSTSUPERSCRIPT 2 / ( 2 - italic_h ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_h / 2 end_POSTSUPERSCRIPT = italic_T start_POSTSUPERSCRIPT 1 - italic_h / 2 end_POSTSUPERSCRIPT ⋅ italic_O ( italic_d start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_T ) .

We now note that modifying the analysis of §5, this rate of safety decay is tight. Indeed, our construction in that section shows that for t1/κ2,𝑡1superscript𝜅2t\leq 1/\kappa^{2},italic_t ≤ 1 / italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , one either incurs a roundwise inefficacy of κ𝜅\kappaitalic_κ or a roundwise violation of κ𝜅\kappaitalic_κ. Accounting for the power-cost, we get a lower bound of the form

either Tκmin(κ2,T) or 𝒮Thκhmin(κ2,T).greater-than-or-equivalent-toeither subscript𝑇𝜅superscript𝜅2𝑇 or superscriptsubscript𝒮𝑇greater-than-or-equivalent-tosuperscript𝜅superscript𝜅2𝑇\textit{either }\mathscr{E}_{T}\gtrsim\kappa\cdot\min(\kappa^{-2},T)\textit{ % or }\mathscr{S}_{T}^{h}\gtrsim\kappa^{h}\cdot\min(\kappa^{-2},T).either script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≳ italic_κ ⋅ roman_min ( italic_κ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , italic_T ) or script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ≳ italic_κ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ⋅ roman_min ( italic_κ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , italic_T ) .

But again, taking κ=T1/2𝜅superscript𝑇12\kappa=T^{-1/2}italic_κ = italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT, we find that

either TT or 𝒮ThT1h/2.either subscript𝑇𝑇 or superscriptsubscript𝒮𝑇superscript𝑇12\textit{either }\mathscr{E}_{T}\geq\sqrt{T}\textit{ or }\mathscr{S}_{T}^{h}% \geq T^{1-h/2}.either script_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≥ square-root start_ARG italic_T end_ARG or script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ≥ italic_T start_POSTSUPERSCRIPT 1 - italic_h / 2 end_POSTSUPERSCRIPT .

Thus, up to polylog terms, the behaviour of doss remains tight in terms of the 𝒮Thsuperscriptsubscript𝒮𝑇\mathscr{S}_{T}^{h}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT behaviour, simultaneously for every h>00h>0italic_h > 0.

Coming back to general smooth losses, we immediately note that the same analysis extends to any loss that is hhitalic_h-Hölder: using the bound f(x)Cxh𝑓𝑥𝐶superscript𝑥f(x)\leq Cx^{h}italic_f ( italic_x ) ≤ italic_C italic_x start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT,

𝒮TfCtρth(xt;δ),superscriptsubscript𝒮𝑇𝑓𝐶subscript𝑡superscriptsubscript𝜌𝑡subscript𝑥𝑡𝛿\mathscr{S}_{T}^{f}\leq C\sum_{t}\rho_{t}^{h}(x_{t};\delta),script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ≤ italic_C ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_δ ) ,

and the bound follows. This extends to losses f𝑓fitalic_f that are smooth in some interval near 0+superscript00^{+}0 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT of the form (0,k)0𝑘(0,k)( 0 , italic_k ). For the upper bound, we may decompose the net violation as

𝒮Tftf(ρt)𝟏{ρt>k}+tf(ρt)𝟏{ρtk}.superscriptsubscript𝒮𝑇𝑓subscript𝑡𝑓subscript𝜌𝑡1subscript𝜌𝑡𝑘subscript𝑡𝑓subscript𝜌𝑡1subscript𝜌𝑡𝑘\mathscr{S}_{T}^{f}\leq\sum_{t}f(\rho_{t})\mathbf{1}\{\rho_{t}>k\}+\sum_{t}f(% \rho_{t})\mathbf{1}\{\rho_{t}\leq k\}.script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_f ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_1 { italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > italic_k } + ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_f ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_1 { italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_k } .

The latter term can be dealt with as above, since f(x)Cxh𝑓𝑥𝐶superscript𝑥f(x)\leq Cx^{h}italic_f ( italic_x ) ≤ italic_C italic_x start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT over (0,k),0𝑘(0,k),( 0 , italic_k ) , while the former term can be bounded as

tf(ρt)𝟙{ρtk}(maxx[0,2]f(x))tρt2/k2=O(k2d2log2T),subscript𝑡𝑓subscript𝜌𝑡1subscript𝜌𝑡𝑘subscript𝑥02𝑓𝑥subscript𝑡superscriptsubscript𝜌𝑡2superscript𝑘2𝑂superscript𝑘2superscript𝑑2superscript2𝑇\sum_{t}f(\rho_{t})\mathds{1}\{\rho_{t}\geq k\}\leq(\max_{x\in[0,2]}f(x))\sum_% {t}\rho_{t}^{2}/k^{2}=O(k^{-2}d^{2}\log^{2}T),∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_f ( italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) blackboard_1 { italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_k } ≤ ( roman_max start_POSTSUBSCRIPT italic_x ∈ [ 0 , 2 ] end_POSTSUBSCRIPT italic_f ( italic_x ) ) ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_O ( italic_k start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ) ,

leading to an additive polylogarithmic overhead beyond the main term. The lower bound also generalises: if on (0,k)0𝑘(0,k)( 0 , italic_k ), f𝑓fitalic_f is hhitalic_h-Hölder but not hsuperscripth^{\prime}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-Hölder for any h>h,superscripth^{\prime}>h,italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_h , then there exists some interval (0,k)0superscript𝑘(0,k^{\prime})( 0 , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) over which f/xh𝑓superscript𝑥f/x^{h}italic_f / italic_x start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT remains both lower and upper bounded, and we can employ our lower bound for 𝒮Thsuperscriptsubscript𝒮𝑇\mathscr{S}_{T}^{h}script_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT for T1/(k)2.much-greater-than𝑇1superscriptsuperscript𝑘2T\gg 1/(k^{\prime})^{2}.italic_T ≫ 1 / ( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .