Causal Contextual Bandits with Adaptive Context
Abstract
We study a variant of causal contextual bandits where the context is chosen based on an initial intervention chosen by the learner. At the beginning of each round, the learner selects an initial action, depending on which a stochastic context is revealed by the environment. Following this, the learner then selects a final action and receives a reward. Given rounds of interactions with the environment, the objective of the learner is to learn a policy (of selecting the initial and the final action) with maximum expected reward. In this paper we study the specific situation where every action corresponds to intervening on a node in some known causal graph. We extend prior work from the deterministic context setting to obtain simple regret minimization guarantees. This is achieved through an instance-dependent causal parameter, , which characterizes our upper bound. Furthermore, we prove that our simple regret is essentially tight for a large class of instances. A key feature of our work is that we use convex optimization to address the bandit exploration problem. We also conduct experiments to validate our theoretical results, and release our code at the project GitHub Repository.
1 Introduction
Recent years have seen an active interest in causal bandits from the research community (Lattimore et al., 2016; Sen et al., 2017a; b; Lee & Bareinboim, 2018; Yabe et al., 2018; Lee & Bareinboim, 2019; Lu et al., 2020; Nair et al., 2021; Lu et al., 2021; 2022; Maiti et al., 2022; Varici et al., 2022; Subramanian & Ravindran, 2022; Xiong & Chen, 2023). In this setting, one assumes an environment comprising of causal variables that are random variables that influence each other as per a given causal (directed, and acyclic) graph. Specifically, the edges in the causal DAG represent causal relationships between variables in the environment. If one of these variables is designated as a reward variable, then the goal of a learner then is to maximize their reward by intervening on certain variables (i.e., by fixing the values of certain variables). The rest of the variables, that are not intervened upon, take values as per their conditional distributions, given their parents in the causal graph. In this work, as is common in literature, we assume that the variables take values in . Of particular interest are causal settings wherein the learner is allowed to perform atomic interventions. Here, at most one causal variable can be set to a particular value, while other variables take values in accordance with their underlying distributions.
It is relevant to note that when a learner performs an intervention in a causal graph, they get to observe the values of multiple other variables in the causal graph. Hence, the collective dependence of the reward on the variables is observed through each intervention. That is, from such an observation, the learner may be able to make inferences about the (expected) reward under other values for the causal variables (Peters et al., 2017). In essence, with a single intervention, the learner is allowed to intervene on a variable (in the causal graph), allowed to observe all other variables, and further, is privy to the effects of such an intervention. Indeed, such an observation in a causal graph is richer than a usual sample from a stochastic process. Hence, a standard goal in causal bandits is to understand the power and limitations of interventions. This goal manifests in the form of develo** algorithms that identify intervention(s) that lead to high rewards, while using as few observations/interventions as possible. We use the term intervention complexity (rather than sample complexity) for our algorithm, to emphasize that interventions are richer than samples.
![Refer to caption](x1.png)
In the learning literature, there are several objectives that an algorithm designer might consider. Cumulative regret, simple regret, and average regret have prominently been studied in literature (Lattimore & Szepesvári, 2020; Slivkins et al., 2019). In this work we focus on minimizing simple regret, wherein the algorithm is given a time budget, up to which it may explore, at which time it has to output a near-optimal policy.
Addressing causal bandits, the notable work of Lattimore et al. (2016) obtains an intervention-complexity bound for minimizing simple regret with a focus on atomic interventions and parallel causal graphs. Maiti et al. (2022) extend this work to obtain intervention-complexity bounds for simple regret in causal graphs with unobserved variables. The work by Lu et al. (2022) extends this setting to causal Markov decision processes (MDPs), while addressing the cumulative regret objective. Combinatorial causal bandits have been studied by Feng & Chen (2023) and Xiong & Chen (2023).
Causal contextual bandits have been studied by Subramanian & Ravindran (2022) where the contexts may be chosen by the learner (rather than be provided by the environment). Here we generalize Subramanian & Ravindran (2022) to a setting where the context is provided by the environment, adaptively, in response to an initial choice of the learner.
Motivating Example: Consider an advertiser looking to post ads on a web-page, say Amazon. They may make requests for a certain type of user demographic to Amazon. Based on this initial request, the platform may actually choose one particular user to show the ad to. At this time, certain details about the user are revealed to the advertiser. For example, the platform may reveal some of the user demographics, as well as certain details about their device. Based on these details, the advertiser may choose one particular ad to show the user. In case the user clicks the ad, the advertiser receives a reward. The goal of the learner is to find optimal choices for initial user preference, as well as ad-content such that user clicks are maximized. We illustrate this example through Figure 1 where we indicate the choices available for template and content interventions.
1.1 Our Contributions
We develop an algorithm to identify near-optimal interventions in causal bandits with adaptive context, and show that the simple regret of such an algorithm is indeed tight for several instances. We highlight the main contributions of our work below.
1. We develop and analyze an algorithm for minimizing simple regret for causal bandits with adaptive context in an intervention efficient manner. We provide an upper-bound on intervention complexity in Theorem 1.
2. Interestingly, the intervention complexity of our algorithm depends on an instance dependent structural parameter—referred to as (see equation (3))— which may be much lower than , where is the number of interventions and is the number of contexts.
3. Notably, our algorithm uses a convex program to identify optimal interventions. Unlike prior work that uses optimization to design exploration (for example see Yabe et al. (2018)), we show (in Appendix Section E) that the optimization problem we design is convex, and is thus computationally efficient. Using convex optimization to design efficient exploration is in fact a distinguishing feature of our work.
4. We provide lower bound guarantees showing that our regret guarantee is tight (up to a log factor) for a large family of instances (see Section 4 and Appendix Section F).
5. We demonstrate using experiments (see Section 5) that our algorithm performs exceeding well as compared to other baselines. We note that this is because for causal variables and contexts.
In conclusion, we provide a novel convex-optimization based algorithm for Causal MDP exploration. We analyze the algorithm to come up with an instance dependent parameter . Further, we prove that our algorithm is sample efficient (see Theorems 1 and 2).
1.2 Additional Related Work
Description | Reference |
---|---|
Simple regret for bandits with parallel causal graphs | Lattimore et al. (2016) |
Simple regret for atomic soft interventions | Sen et al. (2017a) |
Simple regret for non-atomic interventions in causal bandits | Yabe et al. (2018) |
Cumulative regret for general causal graphs | Lu et al. (2020) |
Simple regret in the presence of unobserved confounders | Maiti et al. (2022) |
Cumulative regret for unknown causal graph structure | Lu et al. (2021) |
Cumulative regret for causal contextual bandits with latent confounders | Sen et al. (2017b) |
Simple and cumulative regret for budgeted causal bandits | Nair et al. (2021) |
Cumulative regret for Linear SEMs | Varici et al. (2022) |
Cumulative regret for combinatorial causal bandits | Feng & Chen (2023) |
Cumulative regret for Causal MDPs | Lu et al. (2022) |
Best-intervention for combinatorial causal bandits | Xiong & Chen (2023) |
Additive Causal Bandits with Unknown Graph | Malek et al. (2023) |
Structural Causal Bandits with Unobserved Confounders | Wei et al. (2024) |
Confounded Budgeted Causal Bandits | Jamshidi et al. (2024) |
Cumulative Regret for Causal Bandits with Lipschitz SEMs | Yan et al. (2024) |
Simple regret for causal contextual bandits | Subramanian & Ravindran (2022) |
Simple regret for causal contextual bandits with adaptive context | Our work |
Ever since the introduction of the causal bandit framework by Lattimore et al. (2016), we have seen multiple works address causal bandits in various degrees of generality and using different modelling assumptions. Sen et al. (2017a) addressed the issue of soft atomic interventions using an importance sampling based approach. Soft interventions in the linear structural equation model (SEM) setting was addressed recently by Varici et al. (2022). Yabe et al. (2018) proposed an optimization based approach for non-atomic interventions. This work was extended by Xiong & Chen (2023) to provide instance dependent regret bounds. They also provide guarantees for binary generalized linear models (BGLMs). The question of unknown causal graph structure was addressed by Lu et al. (2021), whereas Nair et al. (2021) study the case where interventions are more expensive than observations.
Maiti et al. (2022) addressed simple regret for graphs containing hidden confounding causal variables, while cumulative regret in general causal graphs was addressed by Lu et al. (2020). A notable work by Lu et al. (2022) formulates the framework for causal MDPs, and they provide cumulative regret guarantees in this setting. Causal contextual bandits were addressed by Subramanian & Ravindran (2022); Sen et al. (2017b), and we extend these works to adaptive contexts.
We summarize the main works in this thread in Table 1 and provide a more detailed set of related works in Appendix A.
2 Notations and Preliminaries
![Refer to caption](x2.png)
![Refer to caption](x3.png)
We model the causal contextual bandit with adaptive context as a contextual bandit problem with a causal graph corresponding to each context. The actions at each context are given by interventions on the causal graph. Additionally, we have a causal graph at the start state, and the context is stochastically dependent on the intervention on the causal graph at the start state. For ease of notation, we will call the start state of the learner as context . The agent starts at context , chooses an intervention, then transitions to one of contexts , chooses another intervention, and then receives a reward; see Figure 2(a).
Assumptions on the Causal Graph: Formally, let be the set of contexts . Then, at each context, there is a Causal Bayesian Network (CBN) represented by a causal graph; see Figure 2(b). In particular, at each context , the causal graph is composed of variables . Each takes values from , with an associated conditional probability (of being equal to 0 or 1), given the other variables in the causal graph. We make the following mild assumptions on the causal graph at each context.
-
1.
The distribution of any node conditioned on it’s parents in the causal graph is a Bernoulli random variable with a fixed parameter.
-
2.
The causal graph at each context is semi-Markovian. This is equivalent to making the following assumptions on the graph. No hidden variable in the graph has a parent. Further, every hidden variable has at most two children, both observable.
-
3.
We transform the causal graph for each context as follows: For every hidden variable with two children, we introduce bidirected edges between them. If no path of bidirected edges exists between an intervenable node and its child, the graph is identifiable – a necessary and sufficient condition for estimating the graph’s associated distribution.(Tian & Pearl, 2002).
Notation | Explanation |
---|---|
Context | Start state |
Context | Intermediate contexts |
Causal Variables: | |
An atomic intervention of the form , or | |
Set of atomic interventions at context | |
Reward on transition from context | |
Causal observational threshold at context | |
diagonal matrix of values | |
Transition probabilities matrix: | |
Transition threshold | |
Policy, a map from contexts to interventions. | |
i.e. for | |
Expectation of the reward at context given intervention |
Interventions: Furthermore, we are allowed atomic interventions, i.e., we can select at most one variable and set it to either or . We will use to denote the set of atomic interventions available at context ; in particular, for . We note that is an empty intervention that allows all the variables to take values from their underlying conditional distributions. Also, and set the value of variable to and , respectively, while leaving all the other variables to independently draw values from their respective distributions. Note that for all , we have . Write .
Reward: The environment provides the learner with a reward upon choosing an intervention at context , which we denote as . Note that is a stochastic function of variables . In particular, for all and each realization , the reward is distributed as .
Given such conditional probabilities, we will write to denote the expected value of reward when intervention is performed at context . Here the expectation is over the parents of the variable in the causal graph, with the intervened variable set at the required value. Note that these parents (of ) may in turn have conditional distributions given their parents. The leaf nodes of the causal graph are considered to have unconditional Bernoulli distributions. For instance, is the expected reward when variable is set to , and all the other variables independently draw values from their respective (conditional) distributions. Indeed, the goal of this work is to develop an algorithm that maximizes the expected reward at context .
Causal Observational Threshold: We denote by , the causal observational threshold111Maiti et al. (2022) extend the causal observational threshold from Lattimore et al. (2016) to the general setting of causal graphs with unobserved confounders from Maiti et al. (2022) at context . This is computed as follows. Let . Further, let be sets parameterized by for every , where indicates the c-component size. Then . The existence of such a threshold at each context is guaranteed by the assumptions we made on the CBNs. In addition, let denote the diagonal matrix of .
Transitions at Context 0: At context , the transition to the intermediate contexts stochastically depends on the random variables . Here, denotes the probability of transitioning into context with atomic intervention ; recall that includes the do-nothing intervention. We will collectively denote these transition probabilities as matrix . Furthermore, write the transition threshold to denote the minimum non-zero value in . Note that matrix is fixed, but unknown.
Policy: A map , between contexts and interventions (performed by the algorithm), will be referred to as a policy. Specifically, is the intervention at context . Note that, for any policy , the expected reward, which we denote as , is equal to . Maximizing expected reward, at each intermediate context , we obtain the overall optimal policy as follows. For :
(1) | ||||
(2) |
Our goal then is to find a policy with (expected) reward as close to that of as possible.
Simple Regret: Conforming to the standard simple-regret framework, the algorithm is given a time budget , i.e., the learner can go through the following process times — (a) start at context . (b) Choose an intervention . (c) Transition to context . (d) Choose an intervention . (e) Receive reward . At the end of these steps, the goal of the learner is to compute a policy. Let the policy returned by the learner be . Then the simple regret is defined as the expected value: . Our algorithm seeks to minimize such a simple regret.
3 Main Algorithm and its Analysis
We now provide the details relating to our main Algorithm, viz. ConvExplore.
The algorithm can be described by five main steps. In the first step, we estimate the transitions to intermediate contexts. In the second step, we estimate the causal observational thresholds at these contexts. In the third step, we estimate the rewards upon doing interventions at these contexts. With good reward estimates and transition probability estimates, the computation of a good policy at the intermediate contexts (step 4) and at the start state (step 5) is straightforward. This Algorithm relies on three subroutines which are detailed in Section B of the Appendix. The key aspect of this algorithm is in designing the exploration of interventions (at the start state and at the intermediate contexts) to be regret-optimal – i.e. trading off exploration time between different interventions such that the policy eventually obtained has near-optimal reward.
Our algorithm (ConvExplore) uses subroutines to estimate the transition probabilities, the causal parameters, and the rewards. From these, it outputs the best available interventions as its policy . Given time budget , the algorithm uses the first rounds to estimate the transition probabilities (i.e., the matrix ) in Algorithm 2. The subsequent rounds are utilized in Algorithm 3 to estimate causal parameters s. Finally, the remaining budget is used in Algorithm 4 to estimate the intervention-dependent reward s, for all intermediate contexts .
To judiciously explore the interventions at context , ConvExplore computes frequency vectors . In such vectors, the th component denotes the fraction of time that each intervention is performed by the algorithm, i.e., given time budget , the intervention will be performed times. Note that, by definition, and the frequency vectors are computed by solving convex programs over the estimates. The algorithm and its subroutines throughout consider empirical estimates, i.e., find the estimates by direct counting. Here, let denote the computed estimate of the matrix and be the estimate of the diagonal matrix . We obtain a regret upper bound via an optimal frequency vector (see Step 4 in ConvExplore).
Recall that for any vector (with non-negative components), the Hadamard exponentiation leads to the vector wherein for each component . We next define a key parameter that specifies the regret bound in Theorem 1 (below). At a high-level, parameter captures the “exploration efficacy” in the MDP, that takes into account the transition probabilities and the exploration requirements at the intermediate layer. Identification of this parameter is a relevant technical contribution of our work; see Section C.1 for a detailed derivation of .
(3) |
Furthermore, we will write to denote the optimal frequency vector in equation (3). Hence, with vector , we have . Note that Step 4 in ConvExplore addresses an analogous optimization problem, albeit with the estimates and . Also, we show in Lemma 11 (see Section E in the supplementary material) that this optimization problem is convex and, hence, Step 4 admits an efficient implementation.
To understand the behaviour of , we first note that whenever the values at the contexts are low, the value is low. Specifically, the values can go as low as (when the s are all ), removing the dependence of on . The upper-bound on is . We see this by first upper-bounding each by . Then, note that whenever , then such that where . Now we can compute that , and thereby ; See footnote222We show detailed Algorithms for estimation of transition probabilities (line 2), estimation of causal observational threshold (line 3), and estimation of rewards (line 4) in Appendix B.
The following theorem that upper bounds the regret of ConvExplore is the main result of the current work. The result requires the algorithm’s time budget to be at least
Theorem 1.
Given number of rounds and as in equation (3), ConvExplore achieves regret
Observe that is independent of the number of contexts and interventions. Therefore dominates when number of interventions at an intermediate context is large.
4 Analysis of the Lower Bound
Since ConvExplore solves an optimization problem, it is a priori unclear that a better algorithm may not provide a regret guarantee better than Theorem 1. In this section, we show that for a large class of instances, it is indeed the case that the regret guarantee we provide is optimal. We provide a lower bound on regret for a family of instances. For any number of contexts , we show that there exist transition matrices and reward distributions () such that regret achieved by ConvExplore (Theorem 1) is tight, up to log factors.
Theorem 2.
For any corresponding to causal variables at contexts , there exists a transition matrix , and probabilities corresponding to causal variables , and reward distributions, such that the simple regret achieved by any algorithm is
5 Experiments
We first list a few baseline algorithms that we compare ConvExplore with. This is followed by a complete description of our experimental setup. Finally, we present and discuss our main results.
Uniform Exploration: This algorithm uniformly explores the interventions in the instance. It first performs all the atomic interventions at the start state in a round robin manner. On transitioning to any context , it performs atomic interventions in a round robin manner. UnifExplore achieves a regret upperbounded by , which is also the optimal lower bound for non-causal algorithms. Hence it serves as a good comparison as it achieves an optimal non-causal simple regret. We plot the comparison with this non-causal regret optimal exploration in Figure 3. We plot the regret with respect to (A) the number of rounds of exploration and (B) with the values of our instance. Notice that at extremely high values ConvExplore does not perform well, as such an instance does not particularly benefit from the causal structure. Even so, with further tuning of constants in our Algorithm, we should achieve a performance similar to UnifExplore.
![Refer to caption](x4.png)
Other Baselines: We now consider several other baselines for comparison, that have been used in literature. Primary amongst these are: (1) UCB at the start state, as well as the intermediate contexts (2) Thompson sampling at the start state, as well as the intermediate contexts (3) Round-robin at the start state, and UCB at the intermediate contexts (4) Round-robin at the start state, and Thomson sampling at the intermediate contexts and (5) UnifExplore which is round-robin at both the start state and at the intermediate contexts.
Setup: We consider intermediate contexts and a causal graphs with variables ( interventions) at each context. The rewards are distributed Bernoulli() for intervention and Bernoulli() otherwise where in the experiments. We set . As in experiments in prior work, we set for and otherwise. Let here. At state , on taking action , we transition uniformly to one of the intermediate contexts. On taking action , we transition with probability to context and probability to any of the other contexts.
We perform two experiments in this setting. In the first one, we run ConvExplore and UnifExplore for time horizon . In the second experiment, we run ConvExplore and UnifExplore for a fixed time horizon with varying in the set . To vary , we vary for the intermediate contexts in the set . We average the regret over runs for each setting. We use CVXPY (Diamond & Boyd (2016)) to solve the convex program at Step 4 in ConvExplore. We release our code in entirety in our anonymized GitHub project repository, for the community to use and improve.
![Refer to caption](x5.png)
![Refer to caption](x6.png)
Results of comparison with UnifExplore: In Figure 3a, we compare the expected simple regret of ConvExplore vs. UnifExplore. Our plots indicate that ConvExplore outperforms UnifExplore and its regret falls rapidly as increases. In Figure 3b, we plot the expected simple regret against for ConvExplore and UnifExplore that was obtained in Experiment , and empirically validate their relationship that was proved in Theorem 1.
![Refer to caption](x7.png)
Results of comparison withother baselines: We find that ConvExplore significantly outperforms baselines other than UnifExplore. Specifically Thompson samplling and UCB are not well tuned to the exploration problem, and hence perform poorly in both the metrics of (1) simple regret as well as (2) probability of finding the best intervention. A mixture of round-robin at the start state with these alternatives at the intermediate context also perform poorly with respect to ConvExplore for this particular exploration problem. In Figure 4 we plot the metrics with exploration budget. In Figure 5 we plot the metrics of interest with the number of contexts at the intermediate stage. Finally, in Figure 6, we plot the simple regret as well as probability of finding the best intervention with our parameter , while kee** the number of intermediate contexts the same. The results of these experiments and full details can be found here.
6 Conclusions
We studied extensions of the causal contextual bandits framework to include adaptive context choice. This is an important problem in practice and the solutions therein have immediate practical applications. The setting of stochastic transition to a context accounted for non-trivial extensions from Subramanian & Ravindran (2022) who studied targeted interventions. We developed a Convex Exploration algorithm for minimizing simple regret under this setting. Furthermore, while Maiti et al. (2022) studied the simple causal bandit setting with unobserved confounders, our work addresses causal contextual bandits with adaptive contexts, under the same constraint of allowing unobserved confounders (assuming identifiability). We identified an instance dependent parameter , and proved that the regret of this algorithm is . The current work also established that, for certain families of instances, this upper bound is essentially tight. Finally, we showed through experiments that our algorithm performs better than uniform exploration in a range of settings. We believe our method of converting the exploration in the causal contextual bandit setting is novel, and may have implications outside the causal setting as well.
Possible generalizations of this work include extensions to non-binary reward settings. Another natural extension would be to derive bounds for L-layered MDPs, extending from the adaptive contextual bandit setting we consider. It would be interesting to see whether that problem reduces to convex exploration as well. Finally, extending convex exploration methods from this paper to other more general simple regret problems may also be a promising avenue for future research.
7 Acknowledgements
Siddharth Barman gratefully acknowledges the support of the Walmart Center for Tech Excellence (CSR WMGT-23-0001) and a SERB Core research grant (CRG/2021/006165).
References
- Acharya et al. (2018) Jayadev Acharya, Arnab Bhattacharyya, Constantinos Daskalakis, and Saravanan Kandasamy. Learning and testing causal models with interventions. Advances in Neural Information Processing Systems, 31, 2018.
- Agrawal & Goyal (2012) Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pp. 39–1. JMLR Workshop and Conference Proceedings, 2012.
- Ali et al. (2005) R Ayesha Ali, Thomas S Richardson, Peter Spirtes, and Jiji Zhang. Towards characterizing markov equivalence classes for directed acyclic graphs with latent variables. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pp. 10–17, 2005.
- Audibert et al. (2010) Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identification in multi-armed bandits. In COLT, pp. 41–53, 2010.
- Auer et al. (1995) P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, FOCS ’95, pp. 322, USA, 1995. IEEE Computer Society. ISBN 0818671831.
- Auer & Ortner (2010) Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65, 2010.
- Auer et al. (2002) Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2–3):235–256, may 2002. ISSN 0885-6125. doi: 10.1023/A:1013689704352. URL https://doi.org/10.1023/A:1013689704352.
- Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning, 2017.
- Bouneffouf et al. (2020) Djallel Bouneffouf, Irina Rish, and Charu Aggarwal. Survey on applications of multi-armed and contextual bandits. In 2020 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8. IEEE, 2020.
- Bubeck et al. (2009) Sebastien Bubeck, Remi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In Algorithmic Learning Theory: 20th International Conference, ALT 2009, Porto, Portugal, October 3-5, 2009. Proceedings 20, pp. 23–37. Springer, 2009.
- Bubeck et al. (2012) Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- Canonne et al. (2017) Clément L Canonne, Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Testing bayesian networks. In Conference on Learning Theory, pp. 370–448. PMLR, 2017.
- Cover & Thomas (2006) Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006. ISBN 0471241954.
- Daskalakis et al. (2019) Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Testing ising models. IEEE Transactions on Information Theory, 65(11):6829–6852, 2019.
- Devroye (1983) Luc Devroye. The equivalence of weak, strong and complete convergence in l1 for kernel density estimates. The Annals of Statistics, 11(3):896–904, 1983. ISSN 00905364. URL http://www.jstor.org/stable/2240651.
- Diamond & Boyd (2016) Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
- Eberhardt et al. (2005) Frederick Eberhardt, Clark Glymour, and Richard Scheines. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among n variables. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pp. 178–184, 2005.
- Even-Dar et al. (2006) Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and stop** conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6), 2006.
- Feng & Chen (2023) Shi Feng and Wei Chen. Combinatorial causal bandits. Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):7550–7558, Jun. 2023. doi: 10.1609/aaai.v37i6.25917. URL https://ojs.aaai.org/index.php/AAAI/article/view/25917.
- Hauser & Bühlmann (2014) Alain Hauser and Peter Bühlmann. Two optimal strategies for active learning of causal models from interventional data. International Journal of Approximate Reasoning, 55(4):926–939, 2014.
- Jamshidi et al. (2024) Fateme Jamshidi, Jalal Etesami, and Negar Kiyavash. Confounded budgeted causal bandits. arXiv preprint arXiv:2401.07578, 2024.
- Kang & Tian (2006) Changsung Kang and ** Tian. Inequality constraints in causal models with hidden variables. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, UAI’06, pp. 233–240, Arlington, Virginia, USA, 2006. AUAI Press. ISBN 0974903922.
- Kocaoglu et al. (2017a) Murat Kocaoglu, Alex Dimakis, and Sriram Vishwanath. Cost-optimal learning of causal graphs. In International Conference on Machine Learning, pp. 1875–1884. PMLR, 2017a.
- Kocaoglu et al. (2017b) Murat Kocaoglu, Karthikeyan Shanmugam, and Elias Bareinboim. Experimental design for learning causal graphs with latent variables. Advances in Neural Information Processing Systems, 30, 2017b.
- Kuleshov & Precup (2014) Volodymyr Kuleshov and Doina Precup. Algorithms for multi-armed bandit problems. arXiv preprint arXiv:1402.6028, 2014.
- Lattimore et al. (2016) Finnian Lattimore, Tor Lattimore, and Mark D. Reid. Causal bandits: Learning good interventions via causal inference. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp. 1189–1197, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
- Lattimore & Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
- Lattimore & Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020. doi: 10.1017/9781108571401.
- Lee & Bareinboim (2018) Sanghack Lee and Elias Bareinboim. Structural causal bandits: Where to intervene? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/c0a271bc0ecb776a094786474322cb82-Paper.pdf.
- Lee & Bareinboim (2019) Sanghack Lee and Elias Bareinboim. Structural causal bandits with non-manipulable variables. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):4164–4172, Jul. 2019. doi: 10.1609/aaai.v33i01.33014164. URL https://ojs.aaai.org/index.php/AAAI/article/view/4320.
- Lu et al. (2020) Yangyi Lu, Amirhossein Meisami, Ambuj Tewari, and William Yan. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence, pp. 141–150. PMLR, 2020.
- Lu et al. (2021) Yangyi Lu, Amirhossein Meisami, and Ambuj Tewari. Causal bandits with unknown graph structure. Advances in Neural Information Processing Systems, 34:24817–24828, 2021.
- Lu et al. (2022) Yangyi Lu, Amirhossein Meisami, and Ambuj Tewari. Efficient reinforcement learning with prior causal knowledge. In Conference on Causal Learning and Reasoning, pp. 526–541. PMLR, 2022.
- Maiti et al. (2022) Aurghya Maiti, Vineet Nair, and Gaurav Sinha. A causal bandit approach to learning good atomic interventions in presence of unobserved confounders. In Uncertainty in Artificial Intelligence, pp. 1328–1338. PMLR, 2022.
- Malek et al. (2023) Alan Malek, Virginia Aglietti, and Silvia Chiappa. Additive causal bandits with unknown graph. In International Conference on Machine Learning, pp. 23574–23589. PMLR, 2023.
- Mannor & Tsitsiklis (2004) Shie Mannor and John N Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648, 2004.
- Nair et al. (2021) Vineet Nair, Vishakha Patil, and Gaurav Sinha. Budgeted and non-budgeted causal bandits. In Arindam Banerjee and Kenji Fukumizu (eds.), The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pp. 2017–2025. PMLR, 2021. URL http://proceedings.mlr.press/v130/nair21a.html.
- Orabona et al. (2012) Francesco Orabona, Nicolo Cesa-Bianchi, and Claudio Gentile. Beyond logarithmic bounds in online learning. In Artificial intelligence and statistics, pp. 823–831. PMLR, 2012.
- Osband & Van Roy (2016) Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
- Pearl & Verma (1995) Judea Pearl and Thomas S Verma. A theory of inferred causation. In Studies in Logic and the Foundations of Mathematics, volume 134, pp. 789–811. Elsevier, 1995.
- Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schlkopf. Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press, 2017. ISBN 0262037319.
- Sen et al. (2017a) Rajat Sen, Karthikeyan Shanmugam, Alexandros G Dimakis, and Sanjay Shakkottai. Identifying best interventions through online importance sampling. In International Conference on Machine Learning, pp. 3057–3066. PMLR, 2017a.
- Sen et al. (2017b) Rajat Sen, Karthikeyan Shanmugam, Murat Kocaoglu, Alex Dimakis, and Sanjay Shakkottai. Contextual bandits with latent confounders: An nmf approach. In Artificial Intelligence and Statistics, pp. 518–527. PMLR, 2017b.
- Shanmugam et al. (2015) Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G Dimakis, and Sriram Vishwanath. Learning causal graphs with small interventions. Advances in Neural Information Processing Systems, 28, 2015.
- Slivkins et al. (2019) Aleksandrs Slivkins et al. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019.
- Spirtes et al. (2000) Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000.
- Subramanian & Ravindran (2022) Chandrasekar Subramanian and Balaraman Ravindran. Causal contextual bandits with targeted interventions. In International Conference on Learning Representations, 2022.
- Tian & Pearl (2002) ** Tian and Judea Pearl. On the testable implications of causal models with hidden variables. In Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pp. 519–527, 2002.
- Varici et al. (2022) Burak Varici, Karthikeyan Shanmugam, Prasanna Sattigeri, and Ali Tajer. Causal bandits for linear structural equation models. arXiv preprint arXiv:2208.12764, 2022.
- Wei et al. (2024) Lai Wei, Muhammad Qasim Elahi, Mahsa Ghasemi, and Murat Kocaoglu. Approximate allocation matching for structural causal bandits with unobserved confounders. Advances in Neural Information Processing Systems, 36, 2024.
- Xiong & Chen (2023) Nuoya Xiong and Wei Chen. Combinatorial pure exploration of causal bandits. In International Conference on Learning Representations, 2023.
- Yabe et al. (2018) Akihiro Yabe, Daisuke Hatano, Hanna Sumita, Shinji Ito, Naonori Kakimura, Takuro Fukunaga, and Ken-ichi Kawarabayashi. Causal bandits with propagating inference. In International Conference on Machine Learning, pp. 5512–5520. PMLR, 2018.
- Yan et al. (2024) Zirui Yan, Dennis Wei, Dmitriy Katz-Rogozhnikov, Prasanna Sattigeri, and Ali Tajer. Causal bandits with general causal models and interventions. arXiv preprint arXiv:2403.00233, 2024.
- Zhang (2008) Jiji Zhang. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence, 172(16-17):1873–1896, 2008.
- Zhang (2020) Junzhe Zhang. Designing optimal dynamic treatment regimes: A causal reinforcement learning approach. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 11012–11022. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/zhang20a.html.
Appendix A Related Work
In our work, we draw from prior literature from causality as well as from multi-armed bandits. We will briefly cover these two in the following section.
A.1 Multi-armed bandits:
The stochastic Multi-Armed Bandit (MAB) setup is a standard model for studying the exploration-exploitation trade-off in sequential decision making problems (Kuleshov & Precup, 2014; Bubeck et al., 2012). Such trade-offs arise in several modern applications, such as ad placement, website optimization, recommendation systems, and packet routing (Bouneffouf et al., 2020) and are thus a central part of the theory relating to online learning (Slivkins et al., 2019; Lattimore & Szepesvári, 2020).
Traditional performance measures for MAB algorithms have focused on cumulative regret (Auer et al., 2002; Agrawal & Goyal, 2012; Auer & Ortner, 2010), as well as best-arm identification under the fixed confidence (Even-Dar et al., 2006) and fixed budget (Audibert et al., 2010) settings. In some settings however, one may be interested in optimizing the exploration phase. Another variant of regret that has been considered is the mini-max regret (Azar et al., 2017) which focuses on the worst case over all possible environments. However, as a metric for pure exploration in MABs, simple regret has been proposed as a natural performance criterion (Bubeck et al., 2009). In this setting, we allow for some period of exploration, after which the learner has to choose an arm. The simple regret is then evaluated as the difference between the average reward of the best arm and the average reward of the learner’s recommendation. We focus on simple regret in this work.
Each of these performance metrics come with their own lower bounds (Orabona et al., 2012; Osband & Van Roy, 2016; Bubeck et al., 2012), which are naturally the benchmarks for any algorithms proposed. The lower bound on simple regret is known to be for a stochastic multi-armed bandit problem with arms. This bound is obtained from the lower bound for pure exploration provided by Mannor & Tsitsiklis (2004).
Note that, a naive approach to the causal bandit problem which simply treats an intervention on each of exponentially many combinations of the nodes as an arm, may thus incur an exponential regret. We now review some of the literature from Causality, which helps in addressing the causal aspects of the problem.
A.2 Causality:
There are three broad threads in causality related to our work. These are causal graph learning, causal testing and causal bandits. We address relevant works in these areas below.
Learning Causal Graphs: Tian & Pearl (2002) laid the grounds for analysing functional functional constraints among the distributions of observed variables in a causal Bayesian networks. Similarly, Kang & Tian (2006) derive such functional constraints over interventional distributions. These two seminal works lead to a great interest in the problem of learning causal graphs.
There have been several studies that provide algorithms to recover the causal graphs from the conditional independence relations in observational data (Pearl & Verma, 1995; Spirtes et al., 2000; Ali et al., 2005; Zhang, 2008). Subsequent work considered the setting when both observational and interventional data are available (Eberhardt et al., 2005; Hauser & Bühlmann, 2014). Kocaoglu et al. (2017a) extend the causal graph learning problem to a budgeted setting. Shanmugam et al. (2015) uses interventions on sets of small size to learn the causal structure. Kocaoglu et al. (2017b) provide an efficient randomized algorithm to learn a causal graph with confounding variables.
Testing over Bayesian networks: Given sample access to an unknown Bayesian Network (Canonne et al., 2017), or Ising model (Daskalakis et al., 2019), one may wish to decide whether an unknown model is equal to a known fixed model, and analyse the sample complexity of this hypothesis test. Acharya et al. (2018) address this question by introducing the concept of covering interventions. These covering interventions allow us to understand the behaviour of multiple interventions (that are covered) simultaneously. We utilize the concept of covering interventions from Acharya et al. (2018) towards our question of finding the optimal intervention in a causal bandit. The area of reinforcement learning over causal bandits has also been studied in Zhang (2020).
Apart from these areas in causality, our primary problem of causal bandits have been addressed by Lattimore et al. (2016); Maiti et al. (2022); Sen et al. (2017a); Lu et al. (2020); Nair et al. (2021); Sen et al. (2017b); Lu et al. (2021; 2022); Varici et al. (2022); Xiong & Chen (2023). We detail these in the main Related Works Section 1.2.
Appendix B Algorithms in Detail
In this section, we outline the three algorithms that are used as helpers in ConvExplore. The first that we outline now, Algorithm 2, would be used to estimate the transition probabilities out of context on taking various actions.
Next we estimate the causal parameters at all contexts through Algorithm 3. Then we will use Algorithm 4 to estimate the rewards on various interventions at the intermediate contexts.
For estimating the causal parameters, we use a variant of SRM-ALG from Maiti et al. (2022), which estimates the causal observational threshold , under the setting of unobserved confounders and identifiability. We note that even in the presence of general causal graphs with hidden variables, SRM-ALG is able to efficiently estimate the rewards of all the arms simultaneously using the observational arm pulls. As mentioned in Section 3 of Maiti et al. (2022), the challenge is to identify the optimal number of arms with bad estimates during the initial phase of the algorithm, such that these arms can be intervened upon at the later phase. The parameter is the minimum conditional probability of , given different configurations of the parents of . Once we have these estimates, the remaining algorithm can proceed as per usual.
Note that in Algorithm 4 there are two phases. In the first phase, we carry out estimates for interventions that have high probability of being observed on the intervention. In the second phase, we specifically perform interventions which have not been observed often enough. This is similar to Algorithm 2 where we carry out the two phases of interventions at context .
Appendix C Proof of Theorem 1
In this section, we restate Theorem 1 and provide its proof, along with all the lemmas that are used in the proof.
Theorem.
Given number of rounds and as in equation (3), ConvExplore achieves regret
C.1 Proof of Theorem 1
To prove the theorem, we analyze the algorithm’s execution as falling under either good event or bad event, and tackle the regret under each.
Definition 1.
We define five events, to (see Table 3), the intersection of which we call as good event, , i.e., . Furthermore, we define the .
Event | Condition | Explanation |
---|---|---|
for every intervention , the empirical estimate of transition probability in each of Algorithms 2, 3 and 4 is good, up to an absolute factor of | ||
our estimate for causal parameter for state 0 is relatively good in Algorithm 2. | ||
our estimate for causal parameter for each context is relatively good in Algorithm 3. | ||
, | The error in estimated transition probability in Algorithm 2 sums to less than where | |
The error in reward estimates in Algorithm 4 is bounded333For the interventions , we can estimate in the first half.Based on these interventions at each context , we get estimates of and the intervention sets such that (I) and (II) interventions in are observed with probability less than .Note that we round robin over the interventions across visits in the second half of the algorithm. by where |
Considering the estimates and , along with frequency vector222 is upperbounded by kn, but is typically significantly smaller (as m may be much smaller than n).If , we can find such that where the interventions in are observed with probability more than and .On each visit to a context , we perform . From these we can estimate values, which may be used to estimate values.In the first half we estimate rewards for the interventions in the first half, and the interventions in in the second half. (computed in Step 4), we define random variable
Note that is a surrogate for . We will show that under the good event, is close to (Lemma 3).
Recall that and here the expectation is with respect to the policy computed by the algorithm. We can further consider the expected sub-optimality of the algorithm and the quality of the estimates (in particular, , and ) under good event (E).
Based on the estimates returned at Step 4 of ConvExplore, either the good event holds, or we have the bad event. We obtain the regret guarantee by first bounding sub-optimality of policies computed under the good event, and then bound the probability of the bad event.
Lemma 1.
For the optimal policy , under the good event (), we have
Proof.
We add and subtract and reduce the expression on the left to: .
We have: (a) (as rewards are bounded) (b) (by ) and (c) (by ). The above expression is thus bounded above by Furthermore, it follows from (See Corollary 2 in Section D.1 in the supplementary material) that (component-wise) . Hence, the above-mentioned expression is bounded above by . Note that the definition of ensures . Further, . Hence, , which establishes the lemma. ∎
We now state another similar lemma for any policy computed under good event.
Lemma 2.
Let be a policy computed by ConvExplore under the good event (). Then,
Proof.
We can add and subtract to the expression on the left to get: . Analogous to Lemma 1, one can show that this expression is bounded above by . ∎
We can also bound to within a constant factor of .
Lemma 3.
Under the good event , we have .
Proof.
Event ensures that (see Corollary 2 in Appendix section D.1). In addition, note that event gives us . From these observations we obtain the desired bound: ; here, the first inequality follows from the fact that is the minimizer of the expression, and for the second inequality, we substitute the appropriate bounds of and . ∎
Recall that:
(4) | ||||
(5) |
We will now define , denoting the sub-optimality of a policy , as the difference between the expected rewards of and . i.e. .
Corollary 1.
For any computed by ConvExplore under good event ,
Proof.
Since ConvExplore selects the optimal policy (maximizing rewards with respect to the estimates), . Combining this with Lemmas 1 and 2, we get under good event. The left-hand-side of this expression is equal to . Using Lemma 3, we get that . ∎
Corollary 1 shows that under the good event, the (true) expected reward of and are within of each other. In Lemma 10 (see Section D.5 in the supplementary material) we will show 444In the second half, we may intervene on the atomic interventions in for time each. that whenever 555Using observations of , we estimate and ..
The above-mentioned bounds together establish Theorem 1 (i.e., bound the regret of ConvExplore): . Since the rewards are bounded between and , we have , for all policies . But giving us . Therefore, Corollary 1 along with Lemma 10, leads to guarantee
Appendix D Bounding the Probability of the Bad Event
Recall that the good event corresponds to (see Definition 1). Write and note that, for the regret analysis, we require an upper bound on . Towards this, in this section we address , for each of the events -, and then apply the union bound.
D.1 Bound on
The next lemma upper bounds the probability of .
Proof.
On performing any intervention at context , the intermediate context that we visit follows a multinomial distribution. Hence, we can apply Devroye’s inequality (for multinomial distributions) to obtain a concentration guarantee; we state the inequality next in our notation.
Lemma 5 (Restatement of Lemma 3 in Devroye (1983)).
Let be the number of times intervention is performed in context . Then, for any and any , we have
. Here, is the support of the distribution (i.e., the number of contexts that can be reached from with a nonzero probability).
Note that each intervention is performed at least times across Algorithms 2, 3 and 4. Setting and above, we get that for each intervention , in each subroutine, .
Note that to apply the inequality, we require , i.e., . In the current context, the support size is at most ; this follows from the fact that on performing any intervention , at most contexts can have . Hence, the requirement reduces to .
Next, we union bound the probability over the interventions (at state ) and the three subroutines, to obtain that, for any intervention and in any subroutine, .
Note that , for any . Hence, for any , we have . This completes the proof of the lemma. ∎
We state below a corollary which provides a multiplicative bound on with respect to , complementing the additive form of .
Corollary 2.
Under event , we have , for all interventions and contexts .
Proof.
Event ensures that , for each interventions and contexts . This, in particular, implies that for each intervention and context the following inequality holds: . Note that if , then the algorithm will never observe context with intervention , i.e., in such a case . For the nonzero s, recall that (by definition), . Therefore, for any nonzero , the above-mentioned inequality gives us . Equivalently, and . Therefore, for all s the corollary holds. ∎
D.2 Bound on Events and
In this section, we bound the probabilities that our estimated s are far away from the true causal parameters s.
Lemma 6.
For any , in Algorithm 2, .
Proof.
We allocate time to Algorithm 2. Lemma 8 of Lattimore et al. (2016) ensures that, for any and , we have , with probability at least . Setting , we get the required probability bound. ∎
Next, we address .
Proof.
Fix any reachable context . Corresponding to such a context, there exists an intervention such that . Event (Corollary 2) implies that .
Now, write to denote the number of times context is visited by the Algorithms 3 and 4. Recall that in the subroutines we estimate by counting the number of times context was reached and simultaneously intervention observed. Furthermore, note that we allocate to every intervention at least time (See Steps 2 in both the subroutines). In particular, intervention was necessarily observed times. Therefore, . This inequality leads to a useful lower bound: .
We now restate Lemma 8 from Lattimore et al. (2016): Let be the number of times context is observed. Then, .
Since , this guarantee of Lattimore et al. (2016) corresponds to .
D.3 Bound on :
The following lemma provides an upper bound for .
Lemma 8.
Let . Then, .
Proof.
As in the proof of Lemma 4, we will use Devroye’s inequality. Write to denote the number of times intervention is observed (in state ) in Algorithm 2. For any and with , Devroye’s inequality gives us . Here, is the size of the support of the multinomial distribution.
We first show that is sufficiently large, for each intervention . Recall that we allocate time to Algorithm 2. Furthermore, we observe each intervention in state , at least times, either as part of the do-nothing intervention or explicitly in Step 10 of Algorithm 2. Now, event ensures that . Hence, each intervention is observed times.
Substituting this inequality for in the above-mentioned probability bound, we obtain
when . As observed in Lemma 4, the support size is at most . Therefore, the requirement on reduces to .
Setting gives us
Therefore , and this probability bound requires . That is, . This inequality is satisfied by our choice of . Hence, the lemma stands proved. ∎
D.4 Bound on
The next lemma bounds .
Lemma 9.
Let . Then, . In other words:
.
Proof.
For intermediate contexts , we denote the realization of the causal parameters and the transition probabilities in Algorithm 4, as and , respectively. The estimates in the previous subroutines are denoted by and .
Event gives us and . Hence, the estimates across the subroutines are close enough: . Similarly, event gives us .
Write to denote the number of times context was visited in Algorithm 4. For all contexts , we first establish a useful lower bound on , under events and . The relevant observation here is that the estimate was computed in Algorithm 4 by counting the number of times context was visited with intervention (at state ). By construction, in Algorithm 4 each intervention was performed at least times. Furthermore, given that was computed via the visitation count, we get that context is visited with intervention at least times. Therefore, . Here, the last inequality follows from the above-mentioned proximity between and .
Now, note that, at each context , Algorithm 4 (by construction) observes every intervention at least times. Write to denote the number of times intervention is observed in this subroutine. Hence,
(6) |
For each context and intervention , define the event as . Hoeffding’s inequality gives us . The last inequality is obtained by substituting Equation 6.
Recall that . Hence, the previous inequality corresponds to .
Note that . Taking a union bound over all contexts and interventions , we obtain . This completes the proof. ∎
D.5 Bound on bad event (F):
Write .
Lemma 10.
for any .
Appendix E Nature of the Optimization Problem
Proposition E.1.
Let . Then, finding is an LP
Proof.
We rewrite the above as a simpler program:
subject to | |||
Where . This is equivalent to the standard form of a linear program, and hence is an LP. ∎
Lemma 11.
is a convex optimization problem
Proof.
First we write the - in terms of a single minimization. First let us use the shorthand and (where ) denote the rows of the matrix
subject to | ||||
(7) | ||||
Proposition E.2.
For any , the function is convex in .
Proof.
We observe that the second derivative is positive. ∎
Proposition E.3.
The constraint equations of OPT are convex in
Proof.
Consider the first constraint of the problem. We can simplify this to get .
Note that the th term in the summand (i.e, ) is of the form for some and . Let be any two vectors, and scalar . We wish to show that .
We have
But is convex as per Proposition E.2. Therefore , as required.
Since is convex, the sum is convex as well. Similarly, all the other constraints are also convex. ∎
Since the constraints are convex in and the objective is linear, OPT is convex. ∎
Appendix F Lower Bounds
This section establishes Theorem 2. We will identify a collection of instances for causal bandits with intermediate feedback and show that, for any given algorithm , there exists an instance in this collection for which ’s regret is .
First we describe the collection of instances and then provide the proof.
For any integer , consider causal variables at each context . The transition matrix is set to be deterministic. Specifically, for each , we have . For all other interventions at context 0, we transition to context k with probability 1. Such a transition matrix can be achieved by setting for all . As before, the total number of interventions .
Now consider a family of instances666Note the change in notation. We used the term instead of in the main paper. This has been amended in a later version of the main paper. . Here, and each is an instance of a causal bandit with intermediate feedback with the above-mentioned transition probabilities. The instances differ in the rewards at the intermediate contexts. In particular, in instance , we set the reward distributions such that for all contexts and interventions . For each and , instance differs from only at context and for intervention . Specifically, by construction, we will have , for a parameter . The expected rewards under all other interventions will be , the same as in .
Given any algorithm , we will consider the execution of over all the instances in the family. The execution of algorithm over each instance induces a trace, which may include the realized transition probabilities , the realized variable probabilities for and and the corresponding s, and the realized rewards . Each of such realizations (random variables) has a corresponding distribution (over many possible runs of the algorithm). We call the measures corresponding to these random variables under the instances and as and , respectively.
F.1 Proof of Theorem 2
For any algorithm and given time budget , we first consider the ’s execution over instance . As mentioned previously, denotes the trace distribution induced by the algorithm for . In particular, write to denote the expected number of times context is visited, .
Recall that and , where the Bernoulli probabilities of the variables at context are sorted to satisfy . Note that these definitions do not depend on the algorithm at hand. The algorithm, however, may choose to perform different interventions different number of times. Write to denote the expected (under ) number of times intervention is performed by the algorithm at context . Furthermore, let random variable denote the number of times intervention is observed at context . Hence, is the expected number of times intervention is observed777Note that can be observed while performing the do-nothing intervention. Also, the expected value accounts for the number of times is explicitly performed and not just observed..
Using the expected values for algorithm and instance , we define a subset of as follows: . The following proposition shows that the size of is sufficiently large.
Proposition F.1.
The set is non-empty. In particular,
Proof.
The upper bound on the size of subset follows directly from its definition: since we have .
For the lower bound on the size of , note that is the expected number of times context is visited by the algorithm. Therefore,
(8) |
Furthermore, by definition, for each intervention we have . Hence, assuming would contradict inequality (8). This observation implies that and, hence, . This completes the proof. ∎
Recall that denotes the number of times intervention is observed at context . The following proposition bounds for each intervention .
Proposition F.2.
For every intervention
Proof.
Any intervention may be observed either when it is explicitly performed by the algorithm or as a random realization (under some other intervention, including do-nothing). Since , the probability that is observed as part of some other intervention is at most . Therefore, the expected number of times that is observed by the algorithm—without explicitly performing it—is at most ; 777Here, we use the fact that the realization of is independent of the visitation of context . recall that the expected number of times context is visited is equal to .
For any intervention , by definition, the expected number of times is performed . Therefore, the proposition follows:
∎
We now state two known results for KL divergence.
Bretagnolle-Huber Inequality (Theorem 14.2 in Lattimore & Szepesvári (2020)) : Let and be any two measures on the same measurable space. Let be any event in the sample space with complement . Then,
(9) |
Bound on KL-Divergence with number of observations (Adaptation of Equation 17 in Lemma B1 from Auer et al. (1995)): Let and be any two measures with differing expected rewards (for exactly the intervention at context ) by an amount . Then,
(10) |
Using this bound on KL divergence and Proposition F.2, we have, for all contexts and interventions :
(11) |
Substituting this in the Bretagnolle-Huber Inequality, we obtain, for any event in the sample space along with all contexts and all interventions :
(12) |
We now define events to lower bound the probability that Algorithm returns a sub-optimal policy. In particular, write to denote the policy returned by algorithm . Note that is a random variable.
For any and any intervention , write to denote the event that—under the returned policy —intervention is not chosen at context , i.e., . Also, let denote the event that policy does not induce a transition to from context , i.e., . Furthermore, write . Note that the complement .
Considering measure , we note that for each context there exists an intervention with the property that . This follows from the fact that . Therefore, for each context there exists an intervention such that .
This bound and inequality 12 imply that for all contexts there exists an intervention that satisfies
(13) |
We will set
(14) |
Therefore takes value either or . We will address these over two separate cases.
Case 1: .
We wish to substitute this value in Equation 13. Towards this, we will state a proposition.
Proposition F.3.
There exists a context such that
Proof.
First, we note the following claim considering all vectors in the probability simplex .
Claim F.1.
For any given set of integers , we have
Proof.
Assume, towards a contradiction, that for all , we have . Then, , for all . Therefore, . However, this is a contradiction as . ∎
Therefore, irrespective of how s are chosen, there always exists a context such that . ∎
For such a context that satisfies Proposition F.3, we note that, or .
Let us now restate Equation 13 for such a context . There exists a context and an intervention that satisfies
(15) |
Note that the last inequality lower bounds the to probability of selecting a non-optimal policy when the algorithm is executed on instance . Furthermore, in instance , for any non-optimal policy we have . Therefore, we can lower bound ’s regret over instance as follows:
(16) | ||||
(17) |
Note that we can construct the instances to ensure that , for all contexts , and, hence, (see Proposition F.1). Therefore Equation 17 gives us:
(18) |
Case 2 We now consider the case when . In such a case, .
We showed in Proposition F.3 that there exists a context such that . Combining the two statements, there exists a context such that . We now restate Inequality 13 for such a context :
Following the exact same procedure as in Case 1, we can derive that . We saw in Case 1 that it is possible to construct instances such that . Therefore the following holds for Case 2 also:
(19) |
Inequalities 18 and 19 imply that there exists a context and an intervention such that, under instance , algorithm ’s regret satisfies
(20) |
We complete the proof of Theorem 2 by showing that in the current context .
Proposition F.4.
For the chosen transition matrix
Proof.
Recall that all the instances, and s, have the same (deterministic) transition matrix . Also, parameter is computed via Equation 3.
Consider any frequency vector over the interventions . From the chosen transition matrix, we have the following:
From here, we can compute the following:
That is, for all , the th component of the vector is equal to . All the remaining components are .
Write for all and . Since is a frequency vector, . In addition,
Therefore, by definition, . Now, using a complementary form of Claim F.1 we obtain . The proposition stands proved.
∎
Finally, substituting Proposition F.4 into Equation 20, we obtain that there exists an instance for which algorithm ’s regret is lower bounded as follows
(21) |
This completes the proof of Theorem 2.
F.2 Proof of Inequality (10)
For completeness, we provide a proof of inequality (10).
Lemma 12.
Proof of Inequality (10).
This proof is based on lemma B1 in Auer et al. (1995). We define a couple of notations for this proof. Let indicate the filtration (of rewards and other observations) up to time . and indicate the reward at time for this proof.
We now state (without proof) a useful lemma for bounding the KL divergence between random variables over a number of observations.
Chain Rule for entropy (Theorem 2.5.1 in Cover & Thomas (2006)): Let be random variables drawn according to . Then
where is the entropy associated with the random variables.
Using the chain rule for entropy
Let be the intervention chosen by the Algorithm at time . Then: | ||||
Since , we get: | ||||
Claim F.2.
Proof.
where the last inequality is obtained from the Taylor series expansion of the . ∎
It follows that: . ∎