Causal Bandits: The Pareto Optimal Frontier of Adaptivity, a Reduction to Linear Bandits, and Limitations around Unknown Marginals
Abstract
In this work, we investigate the problem of adapting to the presence or absence of causal structure in multi-armed bandit problems. In addition to the usual reward signal, we assume the learner has access to additional variables, observed in each round after acting. When these variables -separate the action from the reward, existing work in causal bandits demonstrates that one can achieve strictly better (minimax) rates of regret (Lu et al., 2020). Our goal is to adapt to this favorable “conditionally benign” structure, if it is present in the environment, while simultaneously recovering worst-case minimax regret, if it is not. Notably, the learner has no prior knowledge of whether the favorable structure holds. In this paper, we establish the Pareto optimal frontier of adaptive rates. We prove upper and matching lower bounds on the possible trade-offs in the performance of learning in conditionally benign and arbitrary environments, resolving an open question raised by Bilodeau et al. (2022). Furthermore, we are the first to obtain instance-dependent bounds for causal bandits, by reducing the problem to the linear bandit setting. Finally, we examine the common assumption that the marginal distributions of the post-action contexts are known and show that a nontrivial estimate is necessary for better-than-worst-case minimax rates.
1 Introduction
In real-world decision making, we often want strong worst-case guarantees as well as the ability to adapt to favorable properties of real-world scenarios. Adaptive sequential decision-making offers a framework to design algorithms to achieve these objectives.
In this paper, we explore adaptivity in multi-armed bandit problems. In standard multi-armed bandits, the learner (policy) takes an action, receives a reward, and then this process repeats over a number of rounds. The learner’s regret is the difference between its cumulative reward and the cumulative reward of the single best action in hindsight. Can we work to identify high-reward actions while minimizing regret?
In this work, we assume there is post-action context, i.e., there may be additional information available to the learner after taking an action, beyond the reward signal. In a worst-case analysis, however, the learner can ignore the post-action context and still achieve minimax rates of regret: the worst-case environment will not offer useful information. However, many real-world settings possess the structure of multi-armed bandit problems with post-action context and, in those cases, this additional information is useful towards minimizing regret.
One way that post-action context can be useful is if we can assume causal structure relating the action (i.e., an intervention) to the reward and post-action (post-intervention) context. Several authors have studied models in this vein (Bareinboim et al., 2015; Lattimore et al., 2016). In this work, we build on the framework of Lattimore et al. (2016), wherein the post-action context is assumed to -separate each intervention from its associated reward.
Under -separation, the intervention and reward are independent, conditional on the post-intervention context. Bilodeau et al. (2022) formalized this structure in general terms: a bandit environment is conditionally benign whenever the conditional distribution of the reward, given the post-action context, does not depend on the action.
Minimax regret is well understood for both the classical and causal variant of multi-armed bandits. Notably, algorithms tailored to conditionally benign environments can achieve lower rates of regret, scaling with the number of post-action contexts, rather than the potentially much larger set of actions (Lu et al., 2020; Bilodeau et al., 2022).
Exploiting causal structure is not without its pitfalls. Bilodeau et al. (2022) showed that C-UCB, a minimax optimal causal bandit algorithm, suffers linear regret in some non-benign environments. This raised a natural question: Can we achieve strict adaptivity, i.e., obtain minimax rates simultaneously in the class of conditionally benign environments and in the class of all environments, without knowing in advance which class of environments we will face?
Bilodeau et al. proved that strict adaptivity was impossible, but showed some level of adaptivity was possible. They designed a new algorithm, termed HAC-UCB, and proved that it simultaneously achieves minimax optimal rates on the class of benign environments and always achieves (suboptimal, though sublinear) rates. In light of this result, Bilodeau et al. raised an open problem, asking whether HAC-UCB was, in a sense, Pareto optimal, implying that the slower rate was the price of adaptivity. More generally, we ask:
What is the Pareto optimal frontier of simultaneously achievable rates of regret in the classes of benign and arbitrary environments, and what algorithms achieve these optimal tradeoffs?
In this paper, we address the above question by providing a complete characterization of the Pareto optimal frontier (up to log factors) as well as the achieving algorithms. Besides adaptation, we also study the complexity of causal bandit problems from other perspectives. More specifically, we find a novel reduction from causal bandits to linear bandits, which facilitates the first instance-dependent regret bound for causal bandits and enables the applications of some linear bandit algorithms to causal bandits. We also investigate drop** the common assumption that we have perfect knowledge of “the marginals”, i.e., the distribution of the post-action context variable, under each action. On one hand, we show that it is impossible for any algorithm to enjoy improved minimax regret in benign environments without any knowledge of the true marginals. On the other hand, we identify cases where approximate knowledge of the marginal distributions suffices. Our contributions are explained in more details as follows.
-
•
In Section 3, we establish near-optimal Pareto regret frontiers for the setting of causal bandits, resolving an open problem raised by Bilodeau et al. (2022), see Figure 1. Utilizing a dynamic balancing method introduced by Cutkosky et al. (2021), we derive the upper bound and also prove near-optimal matching lower bounds. Remarkably, we introduce a phenomenon we call the price of adaptivity, to capture the extra regret that one must incur when attempting to adapt to the presence or lack of causal structure. Consequently, we demonstrate that the model selection method introduced by Cutkosky et al. (2021) cannot be generally improved, for any nontrivial general improvement would decrease the price of adaptivity beyond our lower bound.
-
•
In Section 4, we present a novel reduction from causal bandits to linear bandits with conditional sub-Gaussian noise. Utilizing a phased elimination technique (Lattimore et al., 2020), we identify a new dimension measuring the inherent complexity of causal bandits. It allows us to establish the first instance-dependent regret bound and a strictly tighter worst-case regret bound for causal bandits for conditionally benign environments. Additionally, we prove instance-dependent bounds for stochastic linear bandits, which are novel to the best of our knowledge.
-
•
In Section 5, we study the situation where we have limited knowledge of the marginal distributions over post-action contexts. We provide a lower bound indicating that no algorithm can utilize the causal structure to achieve improved minimax rates without such prior knowledge. This partly justifies the common assumption in the causal bandits literature that algorithms are given the marginals. On the other side, we give a regret upper bound for the phased elimination algorithm with access to approximate marginals. This result shows that partial knowledge of the marginals suffices in some regimes.
![Refer to caption](extracted/5699231/pareto.png)
1.1 Related Work
Causal bandits.
The causal bandit model was introduced by Lattimore et al. (2016), where their objective was to identify the best intervention. Such pure exploration problem has been extensively studied since then (Sen et al., 2017; Xiong & Chen, 2022), while some other works focused on regret minimization (Lu et al., 2020; Nair et al., 2021; Bilodeau et al., 2022). Another interesting topic is to relax the causal assumptions. For example, the assumption of known causal graph can be relaxed (Lu et al., 2021; Malek et al., 2023). Our work mainly builds on the study by Bilodeau et al. (2022) regarding adapting to the existence of causal structures as well as approximate marginals.
Model selection.
To achieve adaptivity, a natural idea is to apply some model selection algorithm on top of a group of base learners. There is an extending line of works studying such corralling strategies in the bandit setting (Agarwal et al., 2017; Pacchiano et al., 2020a, b; Cutkosky et al., 2020; Arora et al., 2021; Cutkosky et al., 2021). Agarwal et al. (2017) required certain stability conditions on the base learners, making their algorithm quite restricted. In contrast, some recently proposed general-purpose model selection algorithms for stochastic bandit problems (Pacchiano et al., 2020b; Cutkosky et al., 2020, 2021) are better candidates in our setting, since they only necessitate mild assumptions on the base learners.
Pareto optimal frontier.
When we have multiple performance metrics but are unable to achieve the best under all of them simultaneously, the Pareto optimal frontier becomes a common objective to pursue subsequently. Problems with several competing benchmarks are abundant in bandit literature (Koolen, 2013; Lattimore, 2015; Marinov & Zimmert, 2021; Zhu & Nowak, 2022).
2 Problem Setup
We consider the problem of stochastic bandit with post-action contexts, as defined by Bilodeau et al. (2022) and follow their notations. Let be the finite action space, be the finite context space and be the reward space. For any set , we use to denote the set of all probability distributions supported on . For any , we use to denote its marginal distribution over , and use to denote its the conditional distribution over conditioning on the component.
In this bandit problem, a learner interacts with the stochastic environment for rounds. The role of the environment is instantiated with a family of distributions indexed by actions in . For each round , the learner picks an action from and then receives a context-reward pair which is independently sampled from .
To model learner’s strategy, we need to formalize the information that can be used for learner’s prediction. Let denote the observed history up to round , which is a random variable valued in . A policy by the learner could be modeled as a sequence of measurable maps from ’s to
where is the space of all policies compatible with . Then the learner follows this policy by selecting for each round . Indeed, the distribution of all outcomes over rounds, i.e. , is determined by the environment and the player’s policy together. We will always highlight the ambient joint distribution by the subscript on probabilistic operators and , say and . Additionally, we denote the expected reward for action and the optimal action by
(1) |
The goal of the learner is to choose some policy that maximizes her expected cumulative reward , or equivalently minimizes her expected pseudo-regret
with being the realized regret, which is stochastic.
Conditionally benign property and -separation.
Under certain structures, the post-action context variable enables more efficient exploration and hence smaller regret. One special structure that can be exploited for better regret guarantee in our setting is called conditionally benign property, introduced by Bilodeau et al. (2022).
Definition 2.1.
(Bilodeau et al., 2022, Definition 3.1) An environment is conditionally benign if and only if there exists such that for each , and p-a.s. We further denote the space of all conditionally benign environments by .
The conditional benign property is quite general in the sense that it is equivalent to or weaker than some well-studied causal assumptions (Bilodeau et al. 2022). In particular, the conditionally benign property is the same thing as the context variable being a -separator when is all interventions. To leverage this benign structure, the causal UCB () algorithm recently proposed by Lu et al. (2020) achieves regret, while non-causal algorithms that is unaware of this structure would still incur the possibly worse regret of .
2.1 Adaptivity
A natural question is whether we can compete with C-UCB when the environment is conditionally benign while at the same time still maintain the worst-case regret guarantee, without prior knowledge of the nature of the environment. Unfortunately algorithms designed specific to the benign setting may fail drastically in non-benign settings. For instance, C-UCB provably incurs linear regret in some non-benign environments (Bilodeau et al., 2022). To remedy this, Bilodeau et al. (2022) devised by adding a hypothesis test in each round, which is used for switching away from C-UCB to UCB irreversibly whenever it detects a deviation from conditionally benign property. is able to recover the regret in benign settings and achieve sublinear regret in the worst case.
Prior to this work, we do not know if is optimal. Indeed, Bilodeau et al. (2022) showed that strict adaptation, meaning that always achieving the worst-case regret while still being able to perform as good as when causal structure exists, is impossible. But this does not rule out the possibility of improving the worst-case regret of unilaterally. In this paper we will show that such improvement is indeed feasible and thus obtain an algorithm that dominates . Further we will show that our regret guarantee is not improvable through the lens of Pareto optimality.
Remark 2.2.
Regarding optimal rate of regret under the presence of causal structure, it is easy to show a regret lower bound, nearly matching existing regret upper bounds. Whether the log-factors can be shaved from the upper bound is unknown. However, the lower bound of Bilodeau et al. (2022) still implies that strict adaptation is impossible for general and , since when is, say, , the lower bound in benign settings rules out a upper bound.
Generic algorithms.
For rigorous treatment of adaptivity, we adopt the definition of algorithms as maps from Bilodeau et al. (2022). Specifically, an algorithm is any map from problem-specific inputs to the space of compatible policies
where is the marginal distribution accessed by this algorithm as prior knowledge. When talking about algorithm-induced policies, by default we mean if not stated otherwise, following the common assumption in the literature of causal bandits. We will also deal with the case of imperfect prior knowledge in Section 5, where may not be the exact . For notation simplicity, we will use to denote its induced policy when the problem-specific inputs are clear from context. For example, is the same thing as .
3 The Pareto Regret Frontier
To formalize our notion of Pareto regret frontier, we need the following definition:
Definition 3.1.
A pair of rate functions is said to be realizable if there is an algorithm such that for all and ,
A pair is reasonable if and .
In the following we elide the dependence of rates on and below for clarity. We can now describe the Pareto regret frontier, i.e., the set of optimal realizable pairs of rates.
Theorem 3.2.
There exists universal constants such that
-
1.
Upper bound: If is reasonable and , then is realizable;
-
2.
Lower bound: For all realizable , we have or .
Both upper and lower bounds will be extensively discussed in the following sections.
3.1 Upper Bounds
In this section, we show that our upper bound can be obtained by applying the algorithmic principle of dynamic balancing () in Cutkosky et al. (2021) to the stochastic bandit problem with post-action contexts. This method is motivated by the fact that, under mild assumptions, it can always achieve regret when it is running on top of a collection of regret base learners. So the dependence on in regret by in Bilodeau et al. (2022) is easily improved. The use of dynamic balancing in our bandit setting can be justified by the fact that dynamic balancing does not rely on what kind of (stochastic) contextual information can be observed in the underlying bandit problem. See Appendix A for a detailed explanation.
Input: Two base learners, , factor of candidate regret bound, reward bias and scaling coefficient (hyper-parameters) for each base learner , and confidence level .
-
1.
Set for all and let the set of active learners be
-
2.
For do
-
(a)
Select learner from the active set:
-
(b)
Play action of learner and receive reward and context
-
(c)
Update learner with and
-
(d)
Update and :
-
(e)
Compute adjusted average reward and confidence band for all :
-
(f)
Update the set of active learners:
-
(a)
Note that dynamic balancing algorithm (Algorithm 1) is input by a set of user-specified candidate regret bounds for each base learner (which takes the form of in our setting). In each round, merely picks the base learner with minimal candidate regret bound, and performs a test to identify and deactivate the learners that seem to violate their candidate regret bounds. As long as there is one base learner whose candidate regret is valid, is able to compete with the best of such base learners. A more comprehensive exposition of the idea behind dynamic balancing can be found in Cutkosky et al. (2021).
So naturally, we need one base learner that is favorable in benign instances and another base learner that remains robust to non-benign instances. For example, we can pick and , but note that any other algorithm with similar regret bound can be applied as well. Formally we characterize base learners that enjoys certain regret bound in certain type of environments by the following definition:
Definition 3.3.
Let . A family of learners is a -benign family if, for all , for all benign instances, with probability at least , for all , has regret no larger than . Similarly, a learner is a -arbitrary learner if, for all , for all instances, with probability at least , for all , has regret no larger than .
Let and be the families of instances of the and algorithms, respectively, whose confidence band is scaled by . See Appendix C for details.
Proposition 3.4.
is a -benign family for and is a -arbitrary family for .
Note that the above result for is folklore, but the result for is new. The following result describes the adaptive regret of dynamic balancing acting on a benign family and an arbitrary family, which validates the upper bound in Theorem 3.2. What is more impressive is that to realize every point on the Pareto regret frontier (up to log factors), we need only tune the hyper-parameters in accordingly. We elide the dependence of rates on and below for clarity. See Section A.2 for the proof.
Theorem 3.5.
Let be a -benign family and let be a -arbitrary family of learners, where , . For every pair of reasonable rate functions such that , there exist hyper-parameters , , such that, for all instances , the policy , for , given by Algorithm 1 with and , satisfies
Corollary 3.6.
Taking and , the conclusion of Theorem 3.5 is
Corollary 3.6 indicates that we need to pay an extra factor of in the worst-case regret for adaptivity, and it already improves over the one by in terms of worst-case regret. Moreover, our regret analysis does not require their cumbersome assumption that . Such improvement may be explained as follows. Both dynamic balancing and play with two base learners and decide which to pick in each round. However, is operating in a more reasonable way: alternates between two base learners and never deactivates any of them permanently, whereas first plays the optimistic base learner persistently up to some point and then switches to for the remaining rounds. Thus the regret of incurred by running the optimistic base learner improperly may be dominant.
3.2 Lower Bounds
In this section we elaborate on the lower bound in Theorem 3.2 in the following Theorem 3.7, which is a generalization of (Bilodeau et al., 2022, Theorem 6.2). The proof of Theorem 3.7 closely follows that of the original, but we are able to derive a continuum of lower bounds that constitute the Pareto regret frontier. For completeness, we provide the full proof in Section D.1.
Theorem 3.7.
There exists constants such that, for all MAB algorithms , rate functions , if, for all
then, for all and , there exists a conditionally benign environment such that either or there exists a conditionally benign environment such that
Theorem 3.7 shows that any pair of realizable rates must have their product lower bounded by unless the worst-case regret bound is vacuously large. Combining Theorem 3.5 with Theorem 3.7, we have justified the Pareto optimality of dynamic balancing. As a corollary, we have found a problem of adaptation where model selection method can be optimal and the price of adaptivity is witnessed by the additional multiplicative factor of in the regret bound.
4 Instance-Dependent Bounds via Phased Elimination Algorithm
Besides achieving Pareto optimal regret bounds in Theorem 3.5 that are worst-case in nature, the dynamic balancing algorithm can also enjoy instance-dependent regret at the same time under additional assumptions on the base learners. In particular, may not be our best choice for the benign base learner. To leverage the strength of dynamic balancing, we propose a new causal bandit algorithm that enjoys worst-case regret and a novel logarithmic instance-dependent regret in benign settings in this section. We are the first to pursue instance-dependent results in conditionally benign environments for algorithms that are minimax optimal (up to log factors).
Our new algorithm is built upon the idea of phased elimination with G-optimal design from linear bandits (Lattimore & Szepesvári, 2020; Lattimore et al., 2020). Our regret analysis hinges on a novel reduction from causal bandits to linear bandits. This reduction enables the use of a broad family of linear bandit algorithms in conditionally benign environments, whose regret guarantees remain intact.
Finally, we will discuss the possibilities and challenges regarding adaptive instance-dependent regret.
4.1 Reduction to Linear Bandits
We need additional notations to illustrate our causal-to-linear reduction. For benign instance , define the mean reward vector by . Also, in this section we use to denote its associated marginal distribution vector , and we won’t distinguish between an action and its associated marginal vector .
Recall that in each round we play some action and then observe context and reward . By simply ignoring the realized contexts , we can write , where is conditionally 1-sub-Gaussian since and . So now we may think of the game to be linear bandit with actions being and the unknown mean reward vector being . Therefore, any linear bandit algorithm that allows such conditionally sub-Gaussian noise condition should be able to operate in our benign setting by ignoring the realized contexts. More importantly, its regret analysis will go through without change, and hence its regret bounds are retained without loss.
4.2 Phased Elimination and its Regret Bound
Among all valid linear bandit algorithms that can be applied in conditionally benign environments, we opt for the phase elimination algorithm () over others due to its superior performance whenever our action set is finite. Its pseudo-code is summarized in Algorithm 2, which is essentially the same as Lattimore et al. (2020). However, the regret guarantees we present for are novel. Our first result is an anytime worst-case regret bound, which qualifies PE for being a base learner of dynamic balancing. Again, is the family of instances of phased elimination algorithm, indexed by the confidence level .
Theorem 4.1 (Worst-case regret bound for ).
For all , the policy given by Algorithm 2 satisfies the following regret bound for all conditionally benign environments ,
with probability at least , where and is a universal constant. Note that and it could be in the worst-case. In particular, after taking , we obtain the expected regret bound
Corollary 4.2.
is a -benign family for . Therefore, the expected regret bound in Theorem 3.5 can also be achieved by with and as base learners.
See Appendix B for the proof. Thanks to our reduction, Theorem 4.1 only depends on (up to log factors) rather than . This indicates that the intrinsic complexity of causal bandit problem is not and can be further reduced to , which is not captured by the regret bound of .
Next we give an instance-dependent regret bound for PE. Notice that this bound is even new for stochastic linear bandits (with finite action sets). See Appendix B for the proof.
Theorem 4.3 (Instance-dependent regret bound for ).
For all , the policy given by Algorithm 2 satisfies the following regret for all conditionally benign environments ,
with probability at least , where is the minimal sub-optimality gap of instance and is a universal constant. In particular, taking ,
Input: Action set , marginals , , and confidence level
-
1.
Set and let the initial active set be
-
2.
Find some near-optimal design with and , where
-
3.
Let . Compute and
-
4.
Play each action exactly times and we call these rounds phase . We also observe corresponding context-reward pairs
-
5.
Compute the empirical estimate: where
-
6.
Eliminate low rewarding actions and update the active set:
-
7.
and Goto 2
Remark 4.4.
During the implementation of Algorithm 2, it is possible that cannot span for some such that is singular for any . For example, in later phases can be smaller than . Let’s say . One workaround is to apply some invertible matrix to every such that can be decomposed to a dim- vector and a tail of zeros, and can span . Now we use as our active set in phase and the analysis would go through.
4.3 Roadblocks: Instance-Dependent Bounds
Unlike adaptive worst-case regret studied in Section 3, adaptive instance-dependent regret is less understood and a general theory is still absent in the literature. In particular, we do not know if regret can always be achieved, and whenever achieved, whether it is tight. These issues are illustrated for model selection methods in the following. First, it is easy to see that regret can always be achieved in benign environments, e.g., by corralling and using dynamic balancing, because in this case both base learners admit logarithmic regret. However, the regret bound of is dominant and thus naive calculation only leads to a regret for . It remains open whether we can adapt to the smaller regret achieved by PE in benign environments. Second, regret is not always granted by model selection in non-benign instances. The only exception we are aware of in the literature is the case where the causal base learner is assumed to incur linear regret whenever its candidate regret bound fails (Cutkosky et al., 2021, Theorem 31). If this type of “algorithm gap” holds, will only choose the causal base learner on a number of rounds, and hence enjoy logarithmic regret. Moreover, without changing the parameter setting, is able to realize the Pareto optimal rates up to log factors. However, the “algorithmic gap” requirement on the causal base learner is so stringent that we do not know if it is met by any algorithm in every instance. In Appendix E, we show that a version of incurs linear regret on some instances.
5 Limited Knowledge of the Marginal Distributions over Context Variables
So far, we have assumed that algorithms knows the marginal distribution over the post-action context for each arm. Of course, perfect knowledge of these marginals may not hold in practice. What is the effect of only having access to approximate marginals on achievable rates of regret?
In this section, we study this question. We give a lower bound indicating that, with zero access to the marginals, it is impossible for any algorithm to exploit the causal structure and beat the minimax rate of an arbitrary environment. To model this setting, recall that algorithms are defined as map**s taking to policies. So naturally, algorithms considered agnostic to the marginals should be constant in , leading to the following definition:
Definition 5.1.
An algorithm is said to be agnostic to marginals if, for any and , the map
is constant over . We denote the set of all such algorithms by .
Examples of algorithms from include not only heuristic non-causal algorithms like , but also versions of causal algorithms that are always input by the same marginals. For all algorithm , we will write the policy it induces given as to highlight its independence on the component. Our lower bound shows that, under this zero-marginal-knowledge regime, we cannot do better than the optimal non-causal algorithm.
Theorem 5.2.
For all and MAB algorithms , there exists a conditionally benign environment such that
where is a universal constant.
See Section D.2 for the proof.
Remark 5.3.
Our lower bound improves on Lu et al. (2020, Theorem 4), which is of the form of and only holds for some set of non-causal algorithms, which is a strict subset of .
5.1 Phased Elimination with Approximate Marginals
Despite the negative result Theorem 5.2, we now argue that some level of misspecification is allowed in the prior knowledge of marginals. Upon interacting with environment , suppose we are given some marginal which may deviate from the true to some extent. Now we show that even instantiating with the possibly non-accurate may yield regret, following a similar result for by Bilodeau et al. (2022). First we need the the following definition to measure the amount of deviation of from .
Definition 5.4.
(Bilodeau et al., 2022, Definition 4.2) For any , and are said to be -close if
Due to our reduction in Section 4.1, we can find that causal bandits with misspecified marginals is reduced to the well-studied misspecified linear bandits, which yields the following regret bound that subsumes Theorem 4.1. The proof is largely based on the analysis of phased elimination in Lattimore et al. (2020, Proposition 5.1), with necessary modifications for handling conditionally sub-gaussian noises and providing an anytime regret bound. See Appendix B for details.
Theorem 5.5 (Worst-case regret bound, with approximate marginal distributions).
In any conditionally environment suppose we instantiate with . If and are close, then with probability at least , the regret of is bounded for all rounds by
where is a universal constant and
It is implied that suffices to recover all aforementioned regret guarantees of phased elimination and dynamic balancing. On the other hand, such numerical requirement on is almost necessary for us to avoid the lower bound in Theorem 5.2: from the proof of Theorem 5.2 we will find that when , for any algorithm there exists a conditionally benign environment and approximate marginal such that and are -close, but this algorithm would incur regret on when it is input by .
It is worth mentioning that the factor in the misspecification term cannot be improved in many regimes for linear bandit algorithms (Lattimore et al., 2020). However, is able to shave this factor off (Bilodeau et al., 2022, Theorem 4.3) by utilizing realized contexts rather than the least-square estimate of the mean reward vector . From this perspective, we see there is a price for pursuing better instance-dependent result by ignoring the context information.
6 Conclusions and Discussions
We provide a comprehensive characterization of the Pareto regret frontier for the bandit problem in the context of adapting to causal structure whenever feasible. We also give the first instance-dependent regret bound under conditionally benign environments, based on our novel causal-to-linear reduction. Finally, we show that the common assumption that we have access to the true marginals is necessary in general but still can be relaxed in some cases.
For future works, it would be important to focus on the design of algorithms that are easier to implement compared to running dynamic balancing over some base learners. On the theoretical side, it would be interesting to investigate other causal bandit scenarios involving adaptivity in light of our Pareto regret frontier. For example, we may define a series of “semi-benign” settings interpolating conditionally benign environments and non-benign environments and study the Pareto regret frontier thereof.
Acknowledgements
ZL is supported by the Vector Research Grant at the Vector Institute. IA is supported by the Vatat Scholarship from the Israeli Council for Higher Education. DMR is supported by an NSERC Discovery Grant and funding through his Canada CIFAR AI Chair at the Vector Institute. The authors would like to thank Tomer Koren, Blair Bilodeau and Csaba Szepesvári for helpful discussions at different stages of this work.
References
- Agarwal et al. (2017) Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E. Corralling a band of bandit algorithms. In Conference on Learning Theory. PMLR, 2017.
- Arora et al. (2021) Arora, R., Marinov, T. V., and Mohri, M. Corralling stochastic bandit algorithms. In International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
- Bareinboim et al. (2015) Bareinboim, E., Forney, A., and Pearl, J. Bandits with unobserved confounders: A causal approach. Advances in Neural Information Processing Systems, 28, 2015.
- Bilodeau et al. (2022) Bilodeau, B., Wang, L., and Roy, D. Adaptively exploiting d-separators with causal bandits. Advances in Neural Information Processing Systems, 35, 2022.
- Cutkosky et al. (2020) Cutkosky, A., Das, A., and Purohit, M. Upper confidence bounds for combining stochastic bandits. arXiv preprint arXiv:2012.13115, 2020.
- Cutkosky et al. (2021) Cutkosky, A., Dann, C., Das, A., Gentile, C., Pacchiano, A., and Purohit, M. Dynamic balancing for model selection in bandits and RL. In International Conference on Machine Learning. PMLR, 2021.
- Koolen (2013) Koolen, W. M. The Pareto regret frontier. Advances in Neural Information Processing Systems, 26, 2013.
- Lattimore et al. (2016) Lattimore, F., Lattimore, T., and Reid, M. D. Causal bandits: Learning good interventions via causal inference. Advances in Neural Information Processing Systems, 29, 2016.
- Lattimore (2015) Lattimore, T. The pareto regret frontier for bandits. Advances in Neural Information Processing Systems, 28, 2015.
- Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
- Lattimore et al. (2020) Lattimore, T., Szepesvari, C., and Weisz, G. Learning with good feature representations in bandits and in RL with a generative model. In International Conference on Machine Learning. PMLR, 2020.
- Lu et al. (2020) Lu, Y., Meisami, A., Tewari, A., and Yan, W. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence. PMLR, 2020.
- Lu et al. (2021) Lu, Y., Meisami, A., and Tewari, A. Causal bandits with unknown graph structure. Advances in Neural Information Processing Systems, 34, 2021.
- Malek et al. (2023) Malek, A., Aglietti, V., and Chiappa, S. Additive causal bandits with unknown graph. arXiv preprint arXiv:2306.07858, 2023.
- Marinov & Zimmert (2021) Marinov, T. V. and Zimmert, J. The pareto frontier of model selection for general contextual bandits. Advances in Neural Information Processing Systems, 34, 2021.
- Nair et al. (2021) Nair, V., Patil, V., and Sinha, G. Budgeted and non-budgeted causal bandits. In International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
- Pacchiano et al. (2020a) Pacchiano, A., Dann, C., Gentile, C., and Bartlett, P. Regret bound balancing and elimination for model selection in bandits and rl. arXiv preprint arXiv:2012.13045, 2020a.
- Pacchiano et al. (2020b) Pacchiano, A., Phan, M., Abbasi Yadkori, Y., Rao, A., Zimmert, J., Lattimore, T., and Szepesvari, C. Model selection in contextual stochastic bandit problems. Advances in Neural Information Processing Systems, 33, 2020b.
- Sen et al. (2017) Sen, R., Shanmugam, K., Dimakis, A. G., and Shakkottai, S. Identifying best interventions through online importance sampling. In International Conference on Machine Learning. PMLR, 2017.
- Xiong & Chen (2022) Xiong, N. and Chen, W. Pure exploration of causal bandits. arXiv preprint arXiv:2206.07883, 2022.
- Zhu & Nowak (2022) Zhu, Y. and Nowak, R. Pareto optimal model selection in linear bandits. In International Conference on Artificial Intelligence and Statistics. PMLR, 2022.
Appendix A Regret Analysis for Dynamic Balancing
In this section we show that the regret guarantees of dynamic balancing in Cutkosky et al. (2021) can be generalized to our problem and provide a proof of our main upper bound Theorem 3.5.
Notations.
For base learner , we use to denote its candidate anytime regret bound that is expected to hold in its favorable settings. Throughout we consider with the form of , where implicitly depends on the confidence parameter . Let be the index of the base learner selected in round . is the observed cumulative reward in the first rounds where is picked, and is the number of rounds is picked by the end of round . The local regret of up to round is . We say learner is well-specified if and otherwise it is misspecified. We use to denote any well-specified learner.
A.1 Preliminaries
Roughly speaking, in each round , dynamic balancing works by (1) running a misspecification test to temporarily de-activate misspecified base learners and (2) picking the learner with minimal putative regret among all active learners in this round. In this way, the regret incurred by is comparable to that of the best well-specified learner.
Notice that dynamic balancing was initiated with stochastic contextual bandits (where contexts are revealed prior to actions) in Cutkosky et al. (2021). To see that can also be applied in stochastic bandits with post-action contexts, it is worth identifying several important features of :
-
1.
First of all, the meta decision by on each round only depends on the global information, i.e. and (as well as user-specified and ). In particular, it does not need any information regarding context variables or internal states of base learners.
-
2.
Second, only updates the selected base learner in each round , and the update only uses the reward and contextual information observed in this round, where the context can be either pre-action or post-action, or both. Thus the regret guarantees of would hold regardless of the nature of contexts given that the internal updates of base learners are not affected.
Therefore, the essence of dynamic balancing does not rely on what kind of (stochastic) contextual information can be observed in the underlying (stochastic) bandit problem due to above observations.
Now we state the worst-case regret bound of in Cutkosky et al. (2021) adapted to our setting. First define the good event
on which we are able to control the regret of . According to the analysis of Cutkosky et al. (2021, Lemma 5), we can fix to be some absolute constant (which can be actually set to in our setting) such that for any and . Conditioning on , we have the following regret bound:
Proposition A.1 (Adapted version of Theorem 22 in Cutkosky et al. (2021)).
Let be a -benign family and let be a -arbitrary family of learners. Let be arbitrary positive real numbers. For all , we can set hyper-parameters
in dynamic balancing such that, the policy given by dynamic balancing with satisfies the following: for all instances , conditioning on and the existence of a well-specified base learner , the regret of is bounded by
where is a universal constant.
It is straightforward to see that Proposition A.1 is obtained by taking , and in Cutkosky et al. (2021, Theorem 22).
A.2 Proof of Theorem 3.5
Theorem 3.5 is the immediate consequence of the following regret bound, which is derived by instantiating in Proposition A.1 with specific values.
Proposition A.2.
For every pair of reasonable rate functions such that , we can instantiate Proposition A.1 with such that for all , the policy with the same setup as Proposition A.1 satisfies the following: for all instances , with probability at least , the regret of is bounded by
where is a universal constant.
Now we can see that our main upper bound Theorem 3.5 is proved immediately after taking , and .
Proof of Proposition A.2.
By Definition 3.3, we know that for all conditionally instances , with probability at least , learner is well-specified with and the regret bound in Proposition A.1 holds with . Plugging in , the regret of is bounded by
Similarly for all instances , with probability at least , learner is well-specified with and the regret bound in Proposition A.1 holds with , which is
By our assumption that is reasonable and , we have that and . Hence the regret of for all instances is further bounded by
which completes the proof. ∎
Appendix B Regret analysis of phased elimination
In this section we will prove Theorem 4.3 and Theorem 5.5, while Theorem 4.1 is implied by taking in Theorem 5.5. Recall that the proof of Theorem 5.5 is based on the analysis of phased elimination in Lattimore et al. (2020, Proposition 5.1). For simplicity we will use and to denote the probabilistic operators determined jointly by the underlying conditionally benign environment and the phased elimination algorithm. Also we use to denote the true sub-optimality gap and minimal sub-optimality gap respectively with regards to the underlying instance .
B.1 Prerequisite
Lemma B.1.
(In-phase concentration) For any phase , let
and be the algebra generated by the history up to the start of phase . Then .
Proof of Lemma B.1.
Let be the error term due to the use of inaccurate marginals, then we know that since and are close. Observe that
Using Cauchy-Schwarz inequality and the fact that for all , the second term on the RHS of the above equality can be bounded by
To bound the first term, notice that are fixed given the history prior to the start of phase . Hence are independent conditioned on and bounded by . By standard concentration bounds, we have that with probability at least ,
where the RHS can be rewritten as
Combining the two upper bounds above and taking a union bound over all , we have that with probability at least ,
which finishes the proof. ∎
Since the marginal distributions are possibly not accurate, we may not be able to show that the optimal action is never eliminated with high probability. So what we can hope for is that actions that are near-optimal relative to the best action in are retained in the end of the phase . To be concrete, define to be the true optimal action within . Then we can show that is rather small for any that is not eliminated in the end of phase .
Lemma B.2.
Conditioning on event , for any action not eliminated in the end of phase , it has relative sub-optimality gap .
Proof of Lemma B.2.
According to the rule of updating active set, whenever is not eliminated at the end of phase , it holds
It implies that
where we use the fact that we are conditioning on in the inequality. Hence under the true marginals ,
∎
Now we need to track , the sub-optimality of the best active action in each phase. Observe that since . Then it suffices to control each , to control the growth of .
Lemma B.3.
Conditioning on event , we have .
Proof of Lemma B.3.
Suppose happens. Notice that the results holds trivially if is not eliminated in the end of phase , because in this case . On the other hand, if is eliminated, define to be the empirically best action in the end of phase and then we have
according to the test performed. In the meantime, recall that due to in-phase concentration and closeness between and ,
Hence we get
and
∎
Corollary B.4.
For any and conditioning on , we have that and for all .
Proof of Corollary B.4.
By conditioning on the intersection of all , we have that
which implies that . In particular, there is
Since every action passes the test in the end of th phase and hence is not eliminated, by Lemma B.2 we know
Therefore, for all ,
∎
B.2 Proof of Theorem 5.5
Now we are prepared to prove Theorem 5.5.
Proof of Theorem 5.5.
Let be the index of the phase where round is located. It’s easy to see that . In the following we condition on the event , which happens with probability at least due to Lemma B.1.
Notice that phase is not necessarily completed in the end of round , but we can always round to the regret incurred in the first complete phases. That is,
Since we have controlled sub-optimality of all active actions in Corollary B.4, it holds with probability at least that
where is an absolute constant that can vary from line to line. Thus we have finished the proof. ∎
B.3 Proof of Theorem 4.3
Now we go back to the setting where . The only modification needed to work out Theorem 4.3 is an instance-dependent control over the number of phases for which sub-optimal arms are not entirely eliminated.
Proof of Theorem 4.3.
Again suppose happens for all . From Corollary B.4 we know that every suboptimal action can only be played in those phase s.t. in addition to the first phase. Let
be the maximal number of phases where can be played. It is easy to see that
Hence there are at most number of phases before all suboptimals are eliminated and can be controlled more carefully:
where is an absolute constant that can vary from line to line. Again the above regret bound holds with probability at least so we are done. ∎
Appendix C Anytime Regret Bounds for UCB and C-UCB
In this section we verify Proposition 3.4 for and algorithms for completeness. Note that our anytime regret bound for is new in the literature.
C.1 Preliminaries
For each and , define to be the number of action being chosen in the first rounds, and define to be the number of context being observed up to the first rounds. Further define the mean reward estimates by
Then we introduce the upper confidence bounds used by the UCB-type algorithms under consideration. Given any prescribed confidence parameter , define and for each and . Furthermore, we use and to denote the standard UCB algorithm and C-UCB algorithm (Lu et al. 2020) which run by playing actions and at each round respectively, according to:
Before analyzing the regret of and , let’s finally define some high-probability events on which we can control the regret. For any given confidence parameters , define
and in conditionally benign environments we additionally define
where we recall is well-defined here. First we can see that and happen with probability at least regardless the underlying environment and chosen policy:
Lemma C.1 (Lemma B.1 and B.2 in Bilodeau et al. 2022).
For any and ,
and for any that is conditionally benign and ,
To get our new anytime regret bound for , we need to further condition on which happens with probability at least :
Lemma C.2.
For any and ,
Proof of Lemma C.2.
Define
Then and it is easy to find that is a martingale sequence with respect to . To see this,
Also,
Then by Azuma-Hoeffding,
and we get after taking a union bound over . ∎
C.2 Anytime High-probability Regret Bound
Now we provide our high-probability regret bounds for and that will lead to Proposition 3.4.
Theorem C.3.
In any environment , the regret of is bounded by
for all , conditioning on event which happens with probability at least .
Proof of Theorem C.3.
In event , we have that for all . Hence conditioned on , the regret of up to any round holds
where we use throughout to simplify our notation. ∎
Theorem C.4.
In any conditionally benign environment , the regret of is bounded by
for all , conditioning on event which happens with probability at least .
Proof of Theorem C.4.
Similarly in event we have for all . Additionally,
where is the action played by . Therefore we can control the cumulative the regret of in the first rounds as follows
where in the last inequality we use the same argument as in the proof of Theorem C.3, and the remaining summation term can be controlled by immediately after we further condition on . Therefore, we get
in event . ∎
Combining Theorem C.3 with Theorem C.4 and taking , we thus verfy Proposition 3.4.
Appendix D Proofs of Lower Bounds
In this section we give the full proof of Theorem 3.7 and Theorem 5.2. Note that our proof of Theorem 3.7 mainly adopts but also largely generalizes the one of Bilodeau et al. (2022, Theorem 6.2).
D.1 Proof of Theorem 3.7
Proof of Theorem 3.7.
Fix and . Let be an arbitrary proper subset of and . Fix to be chosen later. Define the family of marginals for all instances appearing in this proof
where probability is evenly spread within and respectively. Then define a conditionally benign environment by
where is a Bernoulli conditional distribution such that
Now we define some non-benign instances. For every , define by
where is a Bernoulli conditional distribution such that
For any MAB algorithm , let be the actual policy implemented by when it’s interacting with and . Then by the divergence decomposition formula and Bretagnolle-Huber inequality,
where in the last step we use for . Combined with the worst-case regret upper bound , it implies that
Realizing , we have
So there exists absolute constants such that whenever , the choice of satisfies and
which completes the proof. ∎
D.2 Proof of Theorem 5.2
Proof of Theorem 5.2.
Fix and . Let be an arbitrary proper subset of and . Fix to be chosen later. For all conditionally benign instances in this proof, we consider to be the Bernoulli distribution given by
which implies that contexts from are more rewarding than those from .
Now define conditionally benign environments , through their marginals
where
where probability is evenly spaced within and . So clearly action 1 is the only optimal action in and action is the only optimal action in , with sub-optimality gap .
Fix algorithm with be the actual policy implemented by . By the divergence decomposition formula (Bilodeau et al. 2022), we have that for every ,
By Bretagnolle–Huber inequality,
Now we pick which implies that . Also for . So
Taking , we know that for some absolute constant , which yields the claim. ∎
In the above proof, it is easy to see that and are close, where for some absolute constant . So for any algorithm input by when interacting with , it is satisfied that and are always close, but the algorithm incurs regret in some instance from .
Appendix E Instances where incurs linear regret
In this section we give an example for to illustrate that to merely force linear regret on a causal bandit algorithm, we need to construct non-benign instances carefully and re-code the algorithm to ensure its erratic behavior in those instances. In particular, we construct a non-benign environment for every such that while the re-coded never plays the optimal arm.
Proposition E.1.
Suppose we modify Algorithm 2 such that, in each phase, we always choose an exact-optimal design whenever feasible. For any and with and , there exists a non-benign environment such that , while Algorithm 2 will never play the optimal arm, hence incurring linear regret,
Proof of Proposition E.1.
For any and with , suppose we index the contexts in arbitrary way such that , and we pick number of arms from and denote them by . Construct marginals as follows:
where we write marginal distributions over as vectors in according to context indices. Then define conditional distributions :
and
In other words, playing arm yields context and deterministic reward, while we could observe or with equal probability and always get the optimal reward by playing arm . So the only optimal arm for is with . We can treat all other as dummy actions by identifying each of them with one of arbitrarily.
Next we will verify the following facts. (1) When no action is eliminated and , any exact G-optimal design does not have positive mass over . (2) Whenever any action is eliminated in the end of phase , it must be that all actions except for are eliminated as well. Then PE would just play till the end. Combining these two facts we can conclude that PE never picks during the interaction with .
No G-optimal design is supported on .
Recall that any G-optimal design maximizes , where over (Lattimore & Szepesvári, 2020, Theorem 21.1). When , can be computed as
Then we can find that any maximizing should have after realizing that for such . Moreover, there is only one G-optimal design in this case, which is .
All actions other than would be eliminated at the same time.
If the first elimination happens in the end of phase , then we must have due to that and rewards are deterministic. So is for all and for . Then the elimination must happen within , and thus every one of it should be eliminated simultaneously. ∎