Order-Optimal Regret in Distributed Kernel Bandits using Uniform Sampling with Shared Randomness

Nikola Pavlovic School of Electrical & Computer Engineering, Cornell University, Ithaca, NY, {np358, qz16}@cornell.edu Sudeep Salgia Carnegie Mellon University, Pittsburgh, PA, [email protected] Qing Zhao School of Electrical & Computer Engineering, Cornell University, Ithaca, NY, {np358, qz16}@cornell.edu

(Feb 2024)

Abstract

We consider distributed kernel bandits where $N$ agents aim to collaboratively maximize an unknown reward function that lies in a reproducing kernel Hilbert space. Each agent sequentially queries the function to obtain noisy observations at the query points. Agents can share information through a central server, with the objective of minimizing regret that is accumulating over time $T$ and aggregating over agents. We develop the first algorithm that achieves the optimal regret order (as defined by centralized learning) with a communication cost that is sublinear in both $N$ and $T$ . The key features of the proposed algorithm are the uniform exploration at the local agents and shared randomness with the central server. Working together with the sparse approximation of the GP model, these two key components make it possible to preserve the learning rate of the centralized setting at a diminishing rate of communication.

1 Introduction

1.1 Distributed Kernel Bandits

We study the problem of zeroth-order online stochastic optimization in a distributed setting, where $N$ agents aim to collaboratively maximize a reward function with communications facilitated by a central server. The reward function $f:\mathcal{X}\rightarrow\mathbb{R}$ is unknown; it is only known that it lives in a Reproducing Kernel Hilbert Space (RKHS) associated with a known kernel $k$ . Each agent sequentially chooses points in the function domain $\mathcal{X}$ to query and subsequently receives noisy feedback on the function values (i.e., random rewards) at the query points. The goal is for each distributed agent to converge quickly to $x^{*}\in\operatorname*{arg\,max}_{x\in\mathcal{X}}f(x)$ , a global maximizer of $f$ . We quantify this goal as minimizing the cumulative regret summed over a learning horizon of length $T$ and over all $N$ agents:

\displaystyle R=\sum_{n=1}^{N}\sum_{t=1}^{T}\left(f(x^{*})-f(x^{(n)}_{t})% \right),

(1)

where $x^{(n)}_{t}$ denotes the point queried by agent $n$ at time $t$ .

The above zeroth-order stochastic optimization problem can be viewed as a continuum-armed kernelized-bandit problem (Srinivas et al., 2010). The expressive power of the RKHS model represents a broad family of objective functions. In particular, it is known that the RKHS of typical kernels, such as the Matérn family of kernels, can approximate almost all continuous functions on compact subsets of $\mathbb{R}^{d}$ (Srinivas et al., 2010). The problem has been studied extensively under a centralized setting with a single decision maker (i.e., $N=1$ ), for which several algorithms have been proposed, including UCB-based algorithms (Srinivas et al., 2010; Chowdhury and Gopalan, 2017; Abbasi-Yadkori et al., 2011), batched pure exploration (Li and Scarlett, 2022), tree-based domain shrinking (Salgia et al., 2021) and RIPS (Camilleri et al., 2021). Optimal learning efficiency in terms of regret order in $T$ has been obtained in both the stochastic (Li and Scarlett, 2022; Salgia et al., 2021) and the contextual setting (Valko et al., 2013).

In addition to learning efficiency, distributed kernel bandits face a new challenge of communication efficiency. Without constraints on the communication overhead, all agents can share their local observations and coordinate their individual query actions at no cost. The distributed problem can be trivially reduced to a centralized one. At the other end of the spectrum is a complete decoupling of the agents, resulting in $N$ independent single-user problems without the benefit of data sharing for accelerated learning. The tension between learning efficiency (which demands data sharing and action coordination) and communication efficiency is evident. A central question to this trade-off is how to achieve the optimal learning rate enjoyed by the centralized setting using a minimum amount of message exchange among agents.

In contrast to the extensive literature on centralized kernel bandits, distributed kernel bandits are much less explored despite their broad applications (e.g., federated learning for hyperparameter tuning (Dai et al., 2020) and collaborative training of neural nets using the recent theory of Neural Tangent Kernel (Jacot et al., 2018)). There exist only a handful of studies under drastically different settings and constraints (see Sec.1.3). For the setting considered in this work, no distributed learning algorithms exist that achieve the optimal regret order with a sublinear (in both $T$ and $N$ ) message exchange among agents.

1.2 Main Results

In this paper, we develop the first algorithm for distributed kernel bandits that achieves the optimal order of regret enjoyed by centralized learning with a sublinear message exchange in both $T$ and $N$ .

To tackle the essential tradeoff between learning rate and communication efficiency, a distributed learning algorithm needs a communication strategy that governs what to communicate and how to integrate the shared information into local query actions. To minimize the total regret that is accumulating over time and aggregating over the agents, the communication strategy needs to work in tandem with the query actions to ensure a continual flow of information available at all agents for decision-making.

A natural answer to what to communicate in a distributed learning problem is certain sufficient local statistics of the underlying unknown parameters. For example, for multi-armed (i.e., discrete arms) and linear bandits, this corresponds to the local estimates of the arm mean values and the mean reward vector respectively. However, for kernel bandits, the corresponding quantity would be an estimate of the function, which is potentially infinite-dimensional and hence an impractical choice for communication. Existing studies resolve this issue by exchanging local query actions and observations across all agents and throughout the learning horizon (Li et al., 2022; Dubey and Pentland, 2020), resulting in a communication cost growing linearly in both $N$ and $T$ .

Even with a communication cost growing linearly in both $N$ and $T$ , preserving the full learning power of a centralized decision maker with $NT$ query points is not immediate. The prevailing approaches to centralized kernel bandits that achieve order optimal regret build on the maximum posterior variance (MPV) sampling strategy (Li and Scarlett, 2022) which queries, at each time, the point with the highest posterior variance conditioned on all past observations. Ensuring such a maximal uncertainty reduction at each query point is believed to be crucial in utilizing the full statistical power of all query points. Unfortunately, such a fully adaptive query strategy is incompatible with the parallel learning among distributed agents. To emulate the MPV sampling at each of the $NT$ query points would require the agents to take turns in their queries and share the local observations immediately with all other agents, an infeasible strategy for most distributed learning problems. Implementing MPV-based sampling in parallel across agents, however, loses the full adaptivity. This is arguably the main obstacle in realizing the optimal learning rate of a centralized kernel bandit in a distributed setting.

To tackle the above challenges, our proposed algorithm represents major departures from the prevailing approaches. Referred to as DUETS (Distributed Uniform Exploration of Trimmed Sets), this algorithm has two key features: uniform exploration at the local agents and shared randomness with the central server.

In DUETS , each agent employs uniform (at random) sampling as the query strategy. Uniform sampling is fully compatible with parallel learning. In particular, note that the union of the local sets of size $t$ query points obtained at the agents through uniform sampling is identical (in distribution) to the set of size $Nt$ query points obtained at a centralized decision maker using the same uniform sampling strategy. This superposition property of uniform sampling allows us to leverage the recent results on random exploration in centralized kernel bandits (Salgia et al., 2023a), and is crucial in achieving the optimal learning rate defined by the centralized setting. In addition to preserving the learning rate of the centralized setting, uniform sampling enjoys advantages in computation as well as communication aspects. Comparing with the MPV strategy that requires an expensive maximization of a non-convex acquisition function for finding each query point, uniform sampling is extremely simple to implement. This computational efficiency can be particularly attractive to distributed local devices. In terms of communication efficiency, uniform sampling makes it possible to bypass the exchange of query points altogether and reduce the exchange of reward observations through the shared randomness strategy detailed below.

In DUETS, each agent has access to an independent coin, i.e., a source of randomness, which is unknown to the other agents but is known to the server. The shared randomness enables the server to reproduce the points queried by the agents, thereby resulting in effective transmission of the local set of queried points at each agent to the server at no communication cost. To reduce the communication overhead associated with the reward observations, we employ sparse approximation of GP models (Wild et al., 2021). The availability of all the queried points at the server provides the perfect platform for leveraging the power of sparse approximation to reduce the communication to a diminishing fraction of the total number of observations. Specifically, the server, with access to all the query points, selects a small subset of points that can approximate, to sufficient accuracy, the posterior statistics corresponding to all the points queried by the agents. This allows a diminishing rate of communication to share local reward observations. It is this integration of uniform sampling, shared randomness, and sparse approximation in DUETS that makes it possible to achieve the optimal learning rate of the centralized setting at a communication cost that is sublinear in both $N$ and $T$ .

We analyze the performance of DUETS and establish that it incurs a cumulative regret of $\widetilde{\mathcal{O}}(\sqrt{NT\gamma_{NT}}\log(T/\delta))$ ¹¹1The notation $\tilde{\mathcal{O}}(\cdot)$ hides poly-logarithmic factors. with probability $1-\delta$ , where $\gamma_{NT}$ denotes the maximal information gain of the kernel and represents the effective dimension of the kernel. Note that this matches the lower bound (up to logarithmic factors) for any centralized algorithm with a total of $NT$ queries as established in Scarlett et al. (2017), thereby establishing the order-optimality of the proposed algorithm. To the best knowledge of the authors, this is the first algorithm to achieve the optimal order of regret for the problem of distributed kernel bandits. We also establish a bound of $\tilde{\mathcal{O}}(\gamma_{NT})$ on the communication cost incurred by DUETS , where communication cost is measured by the number of real numbers transmitted during the algorithm (See Section 2 for more details). This significantly improves over the state-of-the-art of $\mathcal{O}(N\gamma_{NT}^{3})$ achieved by ApproxDisKernelUCB algorithm proposed by Li et al. (2022) and is always guaranteed to sublinear in the total number of queries, $NT$ .

1.3 Related Work

The existing literature on distributed kernel bandits is relatively slim. The most relevant to our work is that by Li et al. (2022), where the authors consider the problem of distributed contextual kernel bandits and propose a UCB based policy with sparse approximation of GP models and intermittent communication. Their proposed policy was shown to incur a cumulative regret of $\widetilde{\mathcal{O}}(\sqrt{NT}\gamma_{NT})$ and communication cost of $\mathcal{O}(N\gamma_{NT}^{3})$ . The DUETS algorithm proposed in this work, offers an improvement over the algorithm in Li et al. (2022) both in terms of regret and communication cost. While the contextual setting with varying arm action sets considered in their work is more general that the setting with a fixed arm set considered in this work, their proposed algorithm does not offer non-trivial reduction in regret or communication cost in the fixed arm setting. Moreover, both the regret and communication cost incurred by the algorithm in Li et al. (2022) are not guaranteed to be sublinear in the total number of queries, $NT$ , for all kernels. Consequently, their algorithm does not guarantee convergence to $x^{*}$ or a non-trivial communication cost for all kernels. On the other hand, both regret and communication cost of DUETS is guaranteed to be sub-linear implying both convergence and communication efficiency.

Among other studies, Du et al. (2023) consider the problem of distributed pure exploration in kernel bandits over finite action set, where they focus on designing learning strategies with low simple regret. In this work, we consider the more challenging continuum-armed setup with a focus on minimizing cumulative regret as opposed to simple regret. Another line of work explores impact of heterogeneity among clients and design algorithms to minimize this impact. Salgia et al. (2023b) consider personalized kernel bandits in which agents have heterogeneous models and aim to optimize the weighted sum of their own reward function and the average reward function over all the agents. Dubey and Pentland (2020) consider heterogeneous distributed kernel bandits over a graph in which they use additional kernel-based modeling to measure task similarity across different agents.

In contrast to the distributed kernel bandit, the problems of distributed multi-armed bandits and linear bandits have been extensively studied. For distributed multi-armed bandits (MAB), a variety of algorithms have been proposed for distributed learning under different network topologies (Landgren et al., 2017; Shahrampour et al., 2017; Sankararaman et al., 2019; Chawla et al., 2020; Zhu et al., 2021). Shi et al. (2021) and Shi and Shen (2021) have analyzed the impact of heterogeneity among agents in the distributed MAB problem. Similarly, the problem of distributed linear bandits is also well-understood in variety of settings with different network topologies (Korda et al., 2016), heterogeneity among agents (Mitra et al., 2021; Ghosh et al., 2021; Hanna et al., 2022) and communication constraints (Mitra et al., 2022; Wang et al., 2019; Huang et al., 2021; Amani et al., 2022; Salgia and Zhao, 2023).

2 Problem Formulation

We consider a distributed learning framework consisting of $N$ agents indexed by $\{1,2,\dots,N\}$ . Under this framework, we study the problem of collaboratively maximizing an unknown function $f:\mathcal{X}\to\mathbb{R}$ , where $\mathcal{X}\subset\mathbb{R}^{d}$ is a compact, convex set. The function $f$ belongs to the Reproducing Kernel Hilbert Space (RKHS), $\mathcal{H}_{k}$ , associated with a known positive definite kernel $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ . The RKHS, $\mathcal{H}_{k}$ , is a Hilbert space that is endowed by with an inner product $\langle\cdot,\cdot\rangle_{\mathcal{H}_{k}}$ that obeys the reproducing property, i.e., $\langle g,k(x,\cdot)\rangle_{\mathcal{H}_{k}}=g(x)$ for all $g\in\mathcal{H}_{k}$ , and induces the norm $\|g\|_{\mathcal{H}_{k}}=\langle g,g\rangle_{\mathcal{H}_{k}}$ .

The agents can access the unknown function by querying the function at different points in the domain $\mathcal{X}$ . Upon querying a point $x\in\mathcal{X}$ , the agent receives a reward $y=f(x)+\epsilon$ , where $\epsilon$ is a noise term. We make the following assumptions on the unknown function $f$ and noise.

Assumption 2.1.

The RKHS norm of the function $f$ is bounded by a known constant $B$ , i.e., $\|f\|_{\mathcal{H}_{k}}\leq B$ .

Assumption 2.2.

The noise term $\epsilon$ is assumed to be independent across all agents and all queries and is a zero-mean, $R$ sub-Gaussian random variable i.e., it satisfies the relation $\mathbb{E}[\exp(\lambda\epsilon)]\leq\exp{\frac{\lambda^{2}R^{2}}{2}}$ for all $\lambda\in\mathbb{R}$ .

Assumption 2.3.

For each $r\in\mathbb{N}$ , there exists a discretization $\mathcal{U}_{r}$ of $\mathcal{X}$ with $|\mathcal{U}_{r}|=\mathrm{poly}(r)$ ²²2The notation $g(x)=\mathrm{poly}(x)$ is equivalent to $g(x)=\mathcal{O}(x^{k})$ for some $k\in\mathbb{N}$ . such that, for any $f\in\mathcal{H}_{k}$ , we have $|f(x)-f([x]_{\mathcal{U}_{r}})|\leq\frac{\|f\|_{\mathcal{H}_{k}}}{r}$ , where $[x]_{\mathcal{U}_{r}}=\operatorname*{arg\,min}_{x^{\prime}\in\mathcal{U}_{r}}% \|x-x^{\prime}\|_{2}$ .

Assumption 2.4.

Let $\mathcal{L}_{\eta}=\{x\in\mathcal{X}|f(x)\geq\eta\}$ denote the level set of $f$ for $\eta\in[-B,B]$ . We assume that for all $\eta\in[-B,B]$ , $\mathcal{L}_{\eta}$ is a disjoint union of at most $M_{f}<\infty$ components, each of which is closed and connected. Moreover, for each such component, there exists a bi-Lipschitzian map between each such component and $\mathcal{X}$ with normalized Lipschitz constant pair $L_{f},L_{f}^{\prime}<\infty$ .

Assumptions 2.1-2.3 are standard, mild assumptions that are commonly adopted in the literature (Srinivas et al., 2010; Chowdhury and Gopalan, 2017; Li and Scarlett, 2022; Vakili et al., 2022, 2021a). The existence of the discretization $\mathcal{U}_{r}$ in Assumption 2.3 has been justified and adopted in previous studies (Srinivas et al., 2010; Vakili et al., 2021a). In particular, the popular class of kernels like Squared Exponential and Matérn kernels are known to be Lipschitz continuous, in which case a $\varepsilon$ -cover of the domain with $\varepsilon=\mathcal{O}(1/r)$ is sufficient to show the existence of such a discretization. At a high level, Assumption 2.4 ensures that the structure of the levels sets of $f$ satisfy a mild regularity condition. This is a mild assumption on $f$ that we require to adopt a result from Salgia et al. (2023a) for our analysis.

The agents collaborate with each other by communicating through a central server. At each time instant, each agent can send a message to the server through the uplink channel. Based on the messages from different agents received by the server, it can then broadcast a message back to all the agents through the downlink channel.

Our objective is to design a distributed learning policy $\pi$ that specifies for each agent $n$ , the point $x_{t}^{(n)}$ to be queried at each time instant $t$ , based on the information available at that agent upto time instant $t$ . The performance of a collaborative learning policy $\pi$ is measured through its performance in terms of both learning and communication efficiency over a learning horizon of $T$ steps. The learning efficiency is measured using the notion of cumulative regret, as defined in (1).

The communication efficiency is measured using the sum of the uplink and downlink communication costs. In particular, let $C_{\mathrm{up}}^{(n)}(T)$ denote the number of real numbers sent by the agent $n$ to the server over the time horizon. The uplink cost of $\pi$ , $C_{\mathrm{up}}^{\pi}(T)$ is then given as the average communication cost over all agents:

\displaystyle C_{\mathrm{up}}^{\pi}(T)=\frac{1}{N}\sum_{n=1}^{N}C_{\mathrm{up}% }^{(n)}(T).

(2)

Similarly, the downlink cost of $\pi$ , $C_{\mathrm{down}}^{\pi}(T)$ is given as the number of real numbers broadcast by the server over the entire time horizon averaged over all agents . The overall communication cost of $\pi$ , $C^{\pi}(T)$ , is given as $C^{\pi}(T)=C_{\mathrm{up}}^{\pi}(T)+C_{\mathrm{down}}^{\pi}(T)$ .

The objective is to design a distributed learning policy that achieves the order-optimal cumulative regret and incurs a low communication cost. We aim to provide high probability bounds on both the cumulative regret and communication cost that hold with probability $1-\delta$ for any given $\delta\in(0,1)$ .

We overview the basis of Gaussian Process models and their sparse approximation, both of which are central to our proposed policy.

2.1 GP Models

In this section we present a brief overview of Gaussian Process models and their application on establishing confidence interval for RKHS elements.

A Gaussian Process (GP) is a random process $G$ indexed by $\mathcal{X}$ and is associated with a mean function $\mu:\mathcal{X}\to\mathbb{R}$ and a positive definite kernel $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ . The random process $G$ is defined such that for all finite subsets of $\mathcal{X}$ , $\{x_{1},x_{2},\dots,x_{m}\}\subset\mathcal{X}$ , $m\in\mathbb{N}$ , the random vector $[G(x_{1}),G(x_{2}),\dots,G(x_{m})]^{\top}$ follows a multivariate Gaussian distribution with mean vector $[\mu(x_{1}),\dots,\mu(x_{n})]]^{\top}$ and covariance matrix $\Sigma=[k(x_{i},x_{j})]_{i,j=1}^{m}$ . Throughout the work, we consider GPs with $\mu\equiv 0$ . When used as a prior for a data generating process under Gaussian noise, the conjugate property provides closed form expressions for the posterior mean and covariance of the GP model. Specifically, given a set of observations $\{\mathbf{X}_{m},\mathbf{Y}_{m}\}=\{(x_{i},y_{i})\}_{i=1}^{m}$ from the underlying process, the expression for posterior mean and variance of GP model is given as follows:

	$\displaystyle\mu_{m}(x)$	$\displaystyle=k_{\mathbf{X}_{m}}(x)^{\top}(\lambda\mathbf{I}_{m}+\mathbf{K}_{% \mathbf{X}_{m},\mathbf{X}_{m}})^{-1}\mathbf{Y}_{m},$		(3)
	$\displaystyle\sigma^{2}_{m}(x)$	$\displaystyle=(k(x,x)-k_{\mathbf{X}_{m}}^{\top}(x)(\lambda\mathbf{I}_{m}+% \mathbf{K}_{\mathbf{X}_{m},\mathbf{X}_{m}})^{-1}k_{\mathbf{X}_{m}}(x)).$		(4)

In the above expressions, $k_{\mathbf{X}_{m}}(x)=[k(x_{1},x),k(x_{2},x)\dots k(x_{n},x)]^{\top}$ , $\mathbf{K}_{\mathbf{X}_{m},\mathbf{X}_{m}}=\{k(x_{i},x_{j})\}_{i,j=1}^{m}$ , $\mathbf{I}_{m}$ is the $m\times m$ identity matrix and $\lambda$ corresponds to the variance of the Gaussian noise.

Following a standard approach in the literature (Srinivas et al., 2010), we model the data corresponding to observations from the unknown $f$ , which belongs to the RKHS of a positive definite kernel $k$ , using a GP with the same covariance kernel $k$ . In particular, we assume a fictitious GP prior over the fixed, unknown function $f$ along with fictitious Gaussian distribution for the noise. The benefit of this approach is that the posterior mean and variance of this GP model serve as tools to both predict the values of the function $f$ and quantify the uncertainty of the prediction at unseen points in the domain, as shown by the following lemma .

Lemma 2.5.

Vakili et al. (2021a, Thm. 1) Assume that 2.1 and 2.2 hold. Given a set of observations $\{\mathbf{X}_{m},\mathbf{Y}_{m}\}$ as described above, such that the query points $\mathbf{X}_{m}$ are chosen independent of the noise sequence, then for a fixed $x\in\mathcal{X}$ , the following relation holds with probability at least $1-\delta$ :

|h(x)-\mu_{m}(x)|\leq\beta(\delta)\cdot\sigma_{m}(x),

where $\beta(\delta)=B+R\sqrt{\frac{2}{\lambda}\log{\left(\frac{2}{\delta}\right)}}$ .

We would like to emphasize that these assumptions are modeling techniques used as a part of algorithm and not a part of the problem setup. In particular, the function $f$ is fixed, deterministic function in $\mathcal{H}_{k}$ and the noise is $R$ -sub-Gaussian.

Lastly, given a set of points $\mathbf{X}_{m}=\{x_{1},x_{2},\dots,x_{m}\}\in\mathcal{X}$ , the information gain of the set $\mathbf{X}_{m}$ is defined as $\gamma_{\mathbf{X}_{m}}:=\frac{1}{2}\log(\det(\mathbf{I}_{m}+\lambda^{-1}% \mathbf{K}_{\mathbf{X}_{m},\mathbf{X}_{m}}))$ . Using this, we can define the maximal information gain of a kernel as $\gamma_{m}:=\sup_{\mathbf{X}_{m}}\gamma_{\mathbf{X}_{m}}$ . Maximal information gain is closely related to the effective dimension of a kernel (Calandriello et al., 2019) and helps characterize the regret performance of kernel bandit algorithms (Srinivas et al., 2010; Chowdhury and Gopalan, 2017). $\gamma_{m}$ depends only the kernel and $\lambda$ and has been shown to be an increasing sublinear function of $m$ (Srinivas et al., 2010; Vakili et al., 2021b).

2.2 Sparse approximation of GP models

The sparsification of GP models refers to the idea of approximating the posterior mean and variance of a GP model, corresponding to a set of observations $\{\mathbf{X}_{m},\mathbf{Y}_{m}\}$ , using a subset of query points $\mathbf{X}_{m}$ . In particular, let $\mathcal{S}$ be a subset of $\mathbf{X}_{m}$ consisting of $r<m$ points. The approximate posterior mean and variance based on the points in $\mathcal{S}$ , referred to as the inducing set, is given as follows(Wild et al., 2021).

	$\displaystyle\tilde{\mu}_{m}(x)$	$\displaystyle=z_{\mathcal{S}}(x)^{\top}\left(\lambda\mathbf{I}_{\|\mathcal{S}\|}% +\mathbf{Z}_{\mathbf{X}_{m},\mathcal{S}}^{\top}\mathbf{Z}_{\mathbf{X}_{m},% \mathcal{S}}\right)^{-1}\mathbf{Z}^{\top}_{\mathcal{X}_{m},\mathcal{S}}\mathbf% {Y}_{m}$		(5)
	$\displaystyle\lambda\tilde{\sigma}^{2}_{m}(x)$	$\displaystyle=\big{[}k(x,x)-z_{\mathcal{S}}^{\top}(x)\mathbf{Z}^{\top}_{% \mathbf{X}_{m},\mathcal{S}}\mathbf{Z}_{\mathbf{X}_{m},\mathcal{S}}\left(% \lambda\mathbf{I}_{\|\mathcal{S}\|}+\mathbf{Z}_{\mathbf{X}_{m},\mathcal{S}}^{% \top}\mathbf{Z}_{\mathbf{X}_{m},\mathcal{S}}\right)^{-1}z_{\mathcal{S}}(x)\big% {]},$		(6)

where $z_{\mathcal{S}}(x)=\mathbf{K}_{\mathcal{S},\mathcal{S}}^{-\frac{1}{2}}k_{% \mathcal{S}}(x)$ and $\mathbf{Z}_{\mathbf{X}_{m},\mathcal{S}}=[z_{\mathcal{S}}(x_{1}),z_{\mathcal{S}% }(x_{2}),\dots,z_{\mathcal{S}}(x_{m})]^{\top}$ .

Note that it is sufficient to know the matrix $\mathbf{Z}_{\mathcal{X}_{m},\mathcal{S}}^{\top}\mathbf{Z}_{\mathcal{X}_{m},{% \mathcal{S}}}\in\mathbb{R}^{r\times r}$ , vector $\mathbf{Z}^{\top}_{\mathcal{X}_{m},\mathcal{S}}\mathbf{Y}_{m}\in\mathbb{R}^{r}$ and the set $\mathcal{S}$ in order for $\tilde{\mu}$ and $\tilde{\sigma}$ to be calculated.

3 The DUETS Algorithm

In this section, we present the proposed algorithm DUETS.

We first describe the randomization at each agent and the shared randomness with the server. Each agent $n$ has a private coin $\mathscr{C}_{n}$ for generating random bits that are independent of those generated by other agents. Each agent’s coin is private to other agents, but known to the central server. As a result, the server can reproduce the random bits generated at all agents.

DUETS employs an epoch-based elimination structure where the domain $\mathcal{X}$ is successively trimmed across epochs to maintain an active region that contains a global maximizer $x^{*}$ with high probability for future exploration. Specifically, in each epoch $j$ , the server and the agents maintain a common active subset of the domain $\mathcal{X}_{j}\subseteq\mathcal{X}$ with $\mathcal{X}_{1}$ initialized to $\mathcal{X}$ . The operations in each epoch are as follows.

During the $j^{\text{th}}$ epoch, each agent $n$ , using its private coin $\mathscr{C}_{n}$ , generates $\mathcal{D}_{j}^{(n)}$ , a set of $T_{j}$ points that are uniformly distributed in the set $\mathcal{X}_{j}$ ³³3If the active region consists of multiple disjoint regions, then we carry out this step for each region separately. For simplicity of description, we assume the active region consists of a single connected component.. $T_{j}$ is set to $\lfloor\sqrt{TT_{j-1}}\rfloor$ , with $T_{1}$ being an input to the algorithm. Each agent $n$ queries all the points in $\mathcal{D}_{j}^{(n)}$ and obtains $\mathbf{Y}_{j}^{(n)}\in\mathbb{R}^{T_{j}}$ , the corresponding vector of reward observations.

Since the server has access to the coins of all the agents, it can faithfully reproduce the set $\mathcal{D}_{j}=\bigcup_{n=1}^{N}\mathcal{D}_{j}^{(n)}$ without any communication between the server and the agents. In order to efficiently communicate the observed reward values from the agents to the server, we leverage sparse approximation of GP models along with the knowledge of the set $\mathcal{D}_{j}$ at the server. The server constructs a global inducing set $\mathcal{S}_{j}$ by including each point in $\mathcal{D}_{j}$ with probability $p_{j}:=p_{0}\sigma_{j,\max}^{2}$ , independent of other points where $\sigma_{j,\max}^{2}=\sup_{x\in\mathcal{X}_{j}}\sigma_{j}^{2}(x)$ and $\sigma_{j}^{2}(\cdot)$ is the posterior variance corresponding to points collected in $\mathcal{D}_{j}$ . Here, $p_{0}=72\log\left(\frac{4NT}{\delta^{\prime}}\right)$ is an appropriately chosen universal constant which ensures that the approximate posterior statistics constructed using $\mathcal{S}_{j}$ are a faithful representation of the true posterior statistics corresponding to the set $\mathcal{D}_{j}$ with probability $1-\delta$ . The server broadcasts the inducing set $\mathcal{S}_{j}$ to all the agents.

Upon receiving the inducing set, each agent $n$ computes the projection $v_{j}^{(n)}\in\mathbb{R}^{|\mathcal{S}_{j}|}$ of its reward vector onto the inducing set as follows:

\displaystyle v_{j}^{(n)}:=\mathbf{Z}_{\mathcal{D}^{(n)}_{j},\mathcal{S}_{j}}^% {\top}\mathbf{Y}^{(n)}_{j}.

(7)

Each agent then sends back the lower-dimensional projected observations $v_{j}^{(n)}$ to the server, which subsequently aggregates them to obtain the vector $\overline{v}_{j}$ given as

\displaystyle\overline{v}_{j}:=\left(\lambda\mathbf{I}_{|\mathcal{S}_{j}|}+% \mathbf{Z}_{\mathcal{D}_{j},\mathcal{S}_{j}}^{\top}\mathbf{Z}_{\mathcal{D}_{j}% ,\mathcal{S}_{j}}\right)^{-1}\left(\sum_{n=1}^{N}v_{j}^{(n)}\right).

(8)

Note that the summation $\sum_{n=1}^{N}v_{j}^{(n)}$ equals to $\mathbf{Z}_{\mathcal{D}_{j},\mathcal{S}_{j}}^{\top}\mathbf{Y}_{j}$ , i.e., projection of the rewards of all agents onto the inducing set. The server then broadcasts the vector $\overline{v}_{j}$ and $\sigma_{j,\max}$ to all the agents. The benefit of sending $\overline{v}_{j}$ as opposed to the sum of rewards is that it allows the agents to compute the posterior mean at the agents using their knowledge of the inducing set $\mathcal{S}_{j}$ (See. Eqn (5)).

As the last step of the epoch, all the agents and the server trim the current set $\mathcal{X}_{j}$ to $\mathcal{X}_{j+1}$ using the following update rule:

\displaystyle\mathcal{X}_{j+1}=\left\{x\in\mathcal{X}_{j}:\tilde{\mu}_{j}(x)% \geq\sup_{x^{\prime}\in\mathcal{X}_{j}}\tilde{\mu}_{j}(x^{\prime})-2\beta(% \delta^{{}^{\prime}})\sigma_{j,\max}\right\},

(9)

where $\delta^{\prime}=\frac{\delta}{2|\mathcal{U}_{T}|\cdot(\log(\log{N}\log{T}))+4)}$ and $\tilde{\mu}_{j}(x)=z^{\top}_{\mathcal{S}_{j}}(x)\overline{v}_{j}$ is the approximate posterior mean computed based on the inducing set $\mathcal{S}_{j}$ (See Eqn. (5)). Since the posterior mean provides an estimate for the function values, the update condition is designed to eliminate all points at which the (estimated) function value is smaller than the current best estimate of the maximum value, upto an estimation error. Note that trimming is a deterministic procedure which ensures that all the agents and the server share a common value of $\mathcal{X}_{j+1}$ .

A detailed pseudocode of both the agent and the server side of the DUETS is provided in Algorithms 1 and 2 respectively.

Algorithm 1 DUETS : Agent

n\in\{1,2,\dots,N\}

1: Input: Size of the first epoch

T_{1}

, error probability

\delta

t\leftarrow 0,j\leftarrow 1

\mathcal{X}_{1}\leftarrow\mathcal{X}

3: while

t<T

\mathcal{D}_{j}^{(n)}=\emptyset

5: for

i\in\{1,2,\dots,T_{j}\}

6: Query a point

x_{t}^{(n)}

uniformly at random from

\mathcal{X}_{j-1}

using the coin

\mathscr{C}_{n}

and observe

y_{t}^{(n)}

\mathcal{D}_{j}^{(n)}\leftarrow\mathcal{D}_{j}^{(n)}\cup\{x_{t}^{(n)}\}

t\leftarrow t+1

9: if

t>T

then

10: Terminate

11: end if

12: end for

13: Receive the global inducing set

\mathcal{S}_{j}

14: Set

v_{j}^{(n)}\leftarrow\mathbf{Z}_{\mathcal{D}^{(n)}_{j},\mathcal{S}_{j}}^{\top}% \mathbf{Y}^{(n)}_{j}

, where

\mathbf{Y}^{(n)}_{j}=[y_{t-T_{j}},y_{t-T_{j}+1},\dots,y_{t}]^{\top}

15: Receive

\overline{v}_{j}

and

\sigma_{j,\mathrm{max}}

from the server

16: Use

\overline{v}_{j}

to compute

\tilde{\mu}_{j}(\cdot)=z^{\top}_{\mathcal{S}_{j}}(\cdot)\overline{v}_{j}

17: Update

\mathcal{X}_{j}

\mathcal{X}_{j+1}

using Eqn. (9)

18:

T_{j+1}\leftarrow\lfloor\sqrt{TT_{j}}\rfloor

19:

j\leftarrow j+1

20: end while

Algorithm 2 DUETS : Server

1: input: Size of the first epoch

T_{1}

, error probability

\delta

t\leftarrow 0,j\leftarrow 1

\mathcal{X}_{1}\leftarrow\mathcal{X}

3: while

t<T

4: Use the coins

\mathscr{C}_{1},\mathscr{C}_{2},\dots,\mathscr{C}_{N}

to reproduce the sets

\mathcal{D}_{j}^{(1)},\mathcal{D}_{j}^{(2)},\dots,\mathcal{D}_{j}^{(N)}

\mathcal{D}_{j}\leftarrow\bigcup_{n=1}^{N}\mathcal{D}_{j}^{(n)}

6: Set

\sigma_{j,\max}\leftarrow\sup_{x\in\mathcal{X}_{j}}\sigma_{j}(x)

7: Construct the set

\mathcal{S}_{j}

by including each point from

\mathcal{D}_{j}

with probability

p_{j}

, independent of other points

8: Broadcast

\mathcal{S}_{j}

to all the agents

9: Receive

v_{j}^{(n)}

from all agents

n\in\{1,2,\dots,N\}

10: Set

\overline{v}_{j}\leftarrow\left(\lambda\mathbf{I}_{|\mathcal{S}_{j}|}+\mathbf{% Z}_{\mathcal{D}_{j}^{(n)},\mathcal{S}_{j}}^{\top}\mathbf{Z}_{\mathcal{D}_{j}^{% (n)},\mathcal{S}_{j}}\right)^{-1}(\sum_{n=1}^{N}v_{j}^{(n)}).

11: Broadcast

\overline{v}_{j}

and

\sigma_{j,\max}

to all the agents

12: Update

\mathcal{X}_{j}

\mathcal{X}_{j+1}

using Eqn. (9)

13:

t\leftarrow t+T_{j}

14:

T_{j+1}\leftarrow\lfloor\sqrt{TT_{j}}\rfloor

15:

j\leftarrow j+1

16: end while

4 Performance Analysis

The following theorem characterizes the regret performance and communication cost of DUETS.

Theorem 4.1.

Consider the distributed kernel bandit problem described in Section 2. For a given $\delta\in(0,1)$ , let the policy parameters of DUETS be such that $T_{1}\geq\overline{M}/N$ and $p_{0}=72\log{\frac{4N}{\delta}}$ . Then with probability at least $1-\delta$ , the regret and communication cost incurred by DUETS satisfy the following relations:

	$\displaystyle R_{\mathrm{DUETS}}$	$\displaystyle=\tilde{\mathcal{O}}(\sqrt{NT\gamma_{NT}}\log(T/\delta))$
	$\displaystyle C_{\mathrm{DUETS}}$	$\displaystyle=\tilde{\mathcal{O}}(\gamma_{NT}).$

Here, $\overline{M}$ is a constant that depends only upon the kernel $k$ and the domain $\mathcal{X}$ and it is independent of $N$ and $T$ .⁴⁴4The constant $\overline{M}$ is the same as one in Lemma 4.3, which has been adopted from Salgia et al. (2023a). We refer the reader to Salgia et al. (2023a) for an exact expression of the constant and additional related discussion.

As shown in above theorem, DUETS achieves order-optimal regret as it matches the lower bound established in Scarlett et al. (2017) upto logarithmic factors. DUETS is the first algorithm to close this gap to the lower bound in the distributed setup and achieve order-optimal regret performance. Moreover, DUETS incurs a communication cost that is sublinear in both $T$ and $N$ for all kernels. Furthermore, it can be much smaller that $NT$ , depending upon the smoothness of the kernel. For example, using the bounds on information gain (Vakili et al., 2021b), we can show that the communication cost incurred by DUETS is $\mathcal{O}(\log^{d}(NT))$ .

Proof.

We provide a sketch of the proof of Theorem 4.1 here. The regret bound is obtained by first bounding the regret incurred by DUETS in each epoch $j$ and then summing the regret across different epochs. In any epoch $j$ , the agents take purely exploratory by uniformly sampling the region $\mathcal{X}_{j}$ . Thus, to bound the regret incurred at any step during an epoch, we use the crude bound $\Delta_{j}:=\sup_{x\in\mathcal{X}_{j}}(f(x^{*})-f(x))$ . Consequently, the regret during the $j^{\text{th}}$ epoch, denoted by $R^{(j)}$ , is upper bounded by $N\cdot\Delta_{j}T_{j}$ . Note that the update criterion (Eqn. (9)) is designed to obtain a refined localization of $x^{*}$ by eliminating the points with low function values consequently leading to smaller values of $\Delta_{j}$ as the algorithm proceeds. The epoch lengths are carefully chosen to balance the increase in epoch length with the decrease in $\Delta_{j}$ to obtain the tightest bound. These ideas are captured in the following lemmas from the regret bound follows.

Lemma 4.2.

Let $\Delta_{j}:=\sup_{x\in\mathcal{X}_{j}}f(x^{*})-f(x)$ . Then, the following bound holds all epochs $j\geq 1$ with probability $1-\frac{\delta}{2}$ .

\Delta_{j}\leq 8\beta(\delta^{\prime})\cdot\left(\sup_{x\in\mathcal{X}_{j-1}}% \sigma_{j}(x)\right)+\frac{4B}{T},

where $\delta^{\prime}=\frac{\delta}{2(\log(\log{N}+\log{T})+4)|\mathcal{U}_{T}|}$ and $\mathcal{U}_{T}$ denotes the discretization defined in Assumption 2.3.

Lemma 4.3.

Let $\sigma_{j}^{2}(\cdot)$ denote the posterior variance corresponding to the set $\mathcal{D}_{j}$ obtained by sampling $NT_{j}$ points uniformly at random from the domain $\mathcal{X}_{j}$ . Then, for $T_{1}\geq\overline{M}(\delta)/N$ and for any $f$ satisfying Assumption 2.4, the following bound holds with probability $1-\delta$ for all epochs $j\geq 1$ :

\displaystyle\sup_{x\in\mathcal{X}_{j}}\sigma_{j}^{2}(x)\leq C_{f,\mathcal{X}}% \cdot\frac{\gamma_{NT_{j}}}{NT_{j}}.

Here $C_{f}$ denotes a constant that depends only on $f$ and the domain $\mathcal{X}$ and is independent of both $N$ and $T$ .

Lemma 4.4.

The total number of epochs in DUETS over a time horizon of $T$ is less than $\log(\log(\max\{N,T\}))+4$ .

Lemma 4.3 is result adopted from the recent work by Salgia et al. (2023a) that establishes bounds on worst-case posterior variance corresponding to a set of randomly sampled points.

For the bound on communication cost, note that each epoch $j$ , the server broadcasts the inducing set $\mathcal{S}_{j}$ , which consists of $|\mathcal{S}_{j}|$ vectors in $\mathbb{R}^{d}$ , the vector $\overline{v}_{j}\in\mathbb{R}^{|\mathcal{S}_{j}|}$ and the scalar $\sigma_{j,\max}$ , resulting in a downlink cost of $\mathcal{O}(|\mathcal{S}_{j}|)$ in epoch $j$ . Similarly, since each agent just uploads ${v}_{j}^{(n)}\in\mathbb{R}^{|\mathcal{S}_{j}|}$ , the uplink cost of DUETS in epoch $j$ also satisfies $\mathcal{O}(|\mathcal{S}_{j}|)$ . Consequently, the communication cost of DUETS in epoch $j$ is bounded by $\mathcal{O}(|\mathcal{S}_{j}|)$ . The following lemma gives a high probability bound on the $|\mathcal{S}_{j}|$ .

Lemma 4.5.

Let $\mathcal{S}_{j}$ denote the inducing set construct in $j^{\text{th}}$ epoch, as outlined in Section 3. Then, with probability at least $1-\delta$ ,

\displaystyle|\mathcal{S}_{j}|\leq C_{f,\mathcal{X}}\cdot\left(3+\log\left(% \frac{\log(\log{N}\log{T})}{\delta}\right)\right)\cdot\gamma_{NT},

holds for all epochs $j$ . In the above expression, $C_{f,\mathcal{X}}$ is same as the constant in Lemma 4.3.

The bound on the communication cost follows directly from Lemmas 4.5 and 4.4. Please refer to Appendix A for a detailed proof.

∎

5 Empirical Studies

We perform several empirical studies to corroborate our theoretical findings. We compare the regret performance and communication cost of our proposed algorithm, DUETS, against three baseline algorithms — DisKernelUCB, ApproxDisKernelUCB and N-KernelUCB. The first two are distributed kernel bandits algorithms proposed in Li et al. (2022). N-KernelUCB is a baseline algorithm considered in Li et al. (2022) where each agent locally runs the GP-UCB algorithm (Chowdhury and Gopalan, 2017) with no communication among the agents.

We compare the performance of all the four algorithm across four benchmark functions. The first two are synthetic functions $h_{1},h_{2}:\mathcal{B}\to\mathbb{R}$ considered in Li et al. (2022), where $\mathcal{B}$ denotes the unit ball centered at origin in $\mathbb{R}^{10}$ . The functions are given by:

	$\displaystyle h_{1}(x)$	$\displaystyle:=\cos(3x^{\top}\theta^{\star})$
	$\displaystyle h_{2}(x)$	$\displaystyle:=(x^{\top}\theta^{\star})^{3}-3(x^{\top}\theta^{\star})^{2}+3(x^% {\top}\theta^{\star})+3.$

For both the functions $\theta^{\star}$ is randomly chosen from the surface of the unit ball $\mathcal{B}$ . The other two functions are Branin (Azimi et al., 2012; Picheny et al., 2013) and Hartmann- $4$ D (Picheny et al., 2013), which are commonly used benchmark functions for Bayesian Optimization. The Branin function is defined over $\mathcal{X}=[0,1]^{2}$ while the Hartmann- $4$ D function is defined over $\mathcal{X}=[0,1]^{4}$ .

We consider a distributed kernel bandit described in Section 2 with $N=10$ agents. For all the experiments, we use the Squared Exponential kernel. The length scale was set to $0.2$ for Branin and to $1$ for all other functions. The observations were corrupted with zero mean Gaussian noise with a standard deviation of $0.2$ . The parameter $D$ for ApproxDisKernelUCB and DisKernelUCB was set to $20$ and $10$ respectively. For DUETS , we set $T_{1}=2$ and $p_{0}=10$ . The parameter $\beta$ was selected using a grid search over $\{0.2,0.5,1,2,5\}$ for all the algorithms. All the algorithms were run for $T=50$ time steps. We averaged the cumulative regret and the communication cost incurred by different algorithms over $5$ Monte Carlo runs.

The cumulative regret incurred by different algorithms across the different benchmark function are shown in the top row of Figure 1. The bottom row consists of the corresponding plots for the communication cost incurred by the different algorithm. The shaded regions denotes error bars upto standard deviation on either side of the mean value. As evident from the plots, DUETS achieves a significantly lower regret as compared to all other algorithms consistently across benchmark functions. DUETS also incurs a smaller communication overhead as compared to other algorithms, corroborating our theoretical results.

References

Abbasi-Yadkori et al. (2011) Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Proceedings of the 25th Annual Conference on Neural Information Processing Systems, 2011. ISBN 9781618395993.
Amani et al. (2022) S. Amani, T. Lattimore, A. György, and L. F. Yang. Distributed Contextual Linear Bandits with Minimax Optimal Communication Cost, 2022. URL http://arxiv.longhoe.net/abs/2205.13170.
Azimi et al. (2012) J. Azimi, A. Jalali, and X. Z. Fern. Hybrid batch bayesian optimization. In Proceedings of the 29th International Conference on Machine Learning, ICML, volume 2, pages 1215–1222, 2012. ISBN 9781450312851.
Calandriello et al. (2019) D. Calandriello, L. Carratino, A. Lazaric, M. Valko, and L. Rosasco. Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret. Proceedings of Machine Learning Research, 99:1–25, 2019.
Camilleri et al. (2021) R. Camilleri, J. Katz-Samuels, and K. Jamieson. High-Dimensional Experimental Design and Kernel Bandits. In Proceedings of the 38th International Conference on Machine Learning, 2021.
Chawla et al. (2020) R. Chawla, A. Sankararaman, A. Ganesh, and S. Shakkottai. The Gossi** Insert-Eliminate Algorithm for Multi-Agent Bandits, 2020.
Chowdhury and Gopalan (2017) S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning, ICML, volume 2, pages 1397–1422, 2017.
Dai et al. (2020) Z. Dai, B. K. H. Low, and P. Jaillet. Federated Bayesian optimization via Thompson sampling. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, volume 2020-Decem, 2020.
Du et al. (2023) Y. Du, W. Chen, Y. Kuroki, and L. Huang. Collaborative Pure Exploration in Kernel Bandit. In Proceedings of the 11th International Conference on Learning Representations, ICLR, 2023.
Dubey and Pentland (2020) A. Dubey and A. Pentland. Kernel methods for cooperative multi-agent contextual bandits. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, pages 2720–2730, 2020. ISBN 9781713821120.
Ghosh et al. (2021) A. Ghosh, A. Sankararaman, and K. Ramchandran. Adaptive Clustering and Personalization in Multi-Agent Stochastic Linear Bandits, 2021. URL http://arxiv.longhoe.net/abs/2106.08902.
Hanna et al. (2022) O. Hanna, L. Yang, and C. Fragouli. Learning from distributed users in contextual linear bandits without sharing the context. In Proceedings of the 36th Annual Conference on Neural Information Processing Systems, volume 35, pages 11049–11062, 2022.
Huang et al. (2021) R. Huang, W. Wu, J. Yang, and C. Shen. Federated Linear Contextual Bandits. In Advances in Neural Information Processing Systems, volume 32, pages 27057–27068, 2021. ISBN 9781713845393.
Jacot et al. (2018) A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, pages 8571–8580, 2018.
Korda et al. (2016) N. Korda, B. Szorenyi, and S. Li. Distributed clustering of linear bandits in peer to peer networks. In 33rd International Conference on Machine Learning, ICML 2016, volume 3, pages 1966–1980, 2016. ISBN 9781510829008.
Landgren et al. (2017) P. Landgren, V. Srivastava, and N. E. Leonard. On distributed cooperative decision-making in multiarmed bandits. In Proceedings of the European Control Conference, ECC, pages 243–248, 2017. ISBN 9781509025916.
Li et al. (2022) C. Li, H. Wang, M. Wang, and H. Wang. Communication Efficient Distributed Learning for Kernelized Contextual Bandits. In Proceedings of the 36th Annual Conference on Neural Information Processing Systems, 2022.
Li and Scarlett (2022) Z. Li and J. Scarlett. Gaussian process bandit optimization with few batches. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, AISTATS, 2022.
Mitra et al. (2021) A. Mitra, H. Hassani, and G. Pappas. Exploiting Heterogeneity in Robust Federated Best-Arm Identification, 2021. URL http://arxiv.longhoe.net/abs/2109.05700.
Mitra et al. (2022) A. Mitra, H. Hassani, and G. J. Pappas. Linear Stochastic Bandits over a Bit-Constrained Channel, 2022.
Picheny et al. (2013) V. Picheny, T. Wagner, and D. Ginsbourger. A benchmark of kriging-based infill criteria for noisy optimization. Structural and Multidisciplinary Optimization, 48(3):607–626, 2013. ISSN 1615147X.
Salgia and Zhao (2023) S. Salgia and Q. Zhao. Distributed linear bandits under communication constraints. In Proceedings of the 40th International Conference on Machine Learning, ICML, pages 29845–29875. PMLR, 2023.
Salgia et al. (2021) S. Salgia, S. Vakili, and Q. Zhao. A domain-shrinking based Bayesian optimization algorithm with order-optimal regret performance. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems, volume 34, 2021.
Salgia et al. (2023a) S. Salgia, S. Vakili, and Q. Zhao. Random exploration in bayesian optimization: Order-optimal regret and computational efficiency, 2023a.
Salgia et al. (2023b) S. Salgia, S. Vakili, and Q. Zhao. Collaborative learning in kernel-based bandits for distributed users. IEEE Transactions on Signal Processing, 71:3956–3967, 2023b.
Sankararaman et al. (2019) A. Sankararaman, A. Ganesh, and S. Shakkottai. Social learning in multi agent multi armed bandits. Proc. ACM Meas. Anal. Comput. Syst., 3(3), dec 2019.
Scarlett et al. (2017) J. Scarlett, I. Bogunovic, and V. Cehver. Lower Bounds on Regret for Noisy Gaussian Process Bandit Optimization. In Conference on Learning Theory, volume 65, pages 1–20, 2017.
Shahrampour et al. (2017) S. Shahrampour, A. Rakhlin, and A. Jadbabaie. Multi-armed bandits in multi-agent networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2786–2790, 2017.
Shi and Shen (2021) C. Shi and C. Shen. Federated Multi-Armed Bandits. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pages 9603–9611, 2021.
Shi et al. (2021) C. Shi, C. Shen, and J. Yang. Federated Multi-armed Bandits with Personalization, 2021. URL http://arxiv.longhoe.net/abs/2102.13101.
Srinivas et al. (2010) N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, ICML, pages 1015–1022, 2010.
Vakili et al. (2021a) S. Vakili, N. Bouziani, S. Jalali, A. Bernacchia, and D.-s. Shiu. Optimal order simple regret for Gaussian process bandits. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems, 2021a.
Vakili et al. (2021b) S. Vakili, K. Khezeli, and V. Picheny. On information gain and regret bounds in Gaussian process bandits. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, AISTATS, 2021b.
Vakili et al. (2022) S. Vakili, J. Scarlett, D.-s. Shiu, and A. Bernacchia. Improved convergence rates for sparse approximation methods in kernel-based learning. In Proceedings of the 39th International Conference on Machine Learning, ICML, pages 21960–21983. PMLR, 2022.
Valko et al. (2013) M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, UAI, pages 654–663, 2013.
Wang et al. (2019) Y. Wang, J. Hu, X. Chen, and L. Wang. Distributed Bandit Learning: Near-Optimal Regret with Efficient Communication. In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019.
Wild et al. (2021) V. Wild, M. Kanagawa, and D. Sejdinovic. Connections and equivalences between the nyström method and sparse variational gaussian processes, 2021.
Zhu et al. (2021) Z. Zhu, J. Zhu, J. Liu, and Y. Liu. Federated Bandit: A Gossi** Approach. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 5(1):1–29, 2021.

Appendix A Appendix A.

A.1 Proof of Theorem 4.1

Proof.

In this section, we provide a detailed proof of Theorem 4.1. For the regret bound, we first bound the regret incurred by DUETS in each epoch $j$ and then sum it across different epochs to obtain a bound on the overall cumulative regret. We first prove the theorem assuming the results from Lemmas 4.2, 4.3 and 4.4 and then separately prove the lemmas.

Consider any epoch $j\geq 1$ and let $R^{(j)}$ denote the regret incurred by DUETS in this epoch. Since the agents take purely exploratory actions by uniform sampling points from the current set, we have the following crude bound $R^{(j)}\leq\Delta_{j}\cdot NT_{j}\cdot M_{f}$ , where $\Delta_{j}:=\sup_{x\in\mathcal{X}_{j}}(f(x^{*})-f(x))$ . The term $NT_{j}\cdot M_{f}$ corresponds to number of points sampled during the epoch as we sample each connected component of $\mathcal{X}_{j}$ , of which there are at most $M_{f}$ , $NT_{j}$ times. For $j=1$ , we use the trivial bound,

\displaystyle\Delta_{1}=\sup_{x\in\mathcal{X}}(f(x^{*})-f(x))\leq 2\sup_{x\in% \mathcal{X}}f(x)\leq 2B,

which gives us $R^{(1)}\leq 2B\cdot NT_{1}\cdot M_{f}$ . On invoking Lemma 4.2 for $j>1$ we obtain,

	$\displaystyle R^{(j)}$	$\displaystyle\leq\Delta_{j}\cdot NT_{j}\cdot M_{f}$
		$\displaystyle\leq NT_{j}\cdot M_{f}\cdot\left(8\beta(\delta^{\prime})\cdot% \left(\sup_{x\in\mathcal{X}_{j-1}}\sigma_{j-1}(x)\right)+\frac{4B}{T}\right),$

where $\delta^{\prime}=\dfrac{\delta}{{2(\log\log{NT}+4)|\mathcal{U}_{T}|}}$ . Using Lemma 4.3, we can further bound this expression as

	$\displaystyle R^{(j)}$	$\displaystyle\leq\Delta_{j}\cdot NT_{j}\cdot M_{f}$
		$\displaystyle\leq NT_{j}\cdot M_{f}\cdot\left(8\beta(\delta^{\prime})\cdot C_{% f,\mathcal{X}}\cdot\sqrt{\frac{\gamma_{NT_{j-1}}}{NT_{j-1}}}+\frac{4B}{T}\right)$
		$\displaystyle\leq M_{f}\cdot\left(8C_{f,\mathcal{X}}^{1/2}\cdot\beta(\delta^{% \prime})\cdot\sqrt{NT\gamma_{NT_{j-1}}}+\frac{4BNT_{j}}{T}\right)$
		$\displaystyle\leq M_{f}\cdot\left(8C_{f,\mathcal{X}}^{1/2}\cdot\beta(\delta^{% \prime})\cdot\sqrt{NT\gamma_{NT}}+\frac{4BNT_{j}}{T}\right).$

In the third line, we used the inequality $\frac{T_{j}}{\sqrt{T_{j-1}}}\leq\sqrt{T}$ , which follows from the definition of $T_{j}$ . In the last line, we used the fact that $\gamma_{m}$ is an increasing function of $m$ . Thus, if $J$ denotes an upper bound on the number of epochs, we can write:

$\displaystyle\sum_{j=1}^{J}R^{(j)}$	$\displaystyle\leq 2BM_{f}\cdot NT_{1}+\sum_{j=2}^{J}M_{f}\cdot\left(8C_{f,% \mathcal{X}}^{1/2}\cdot\beta(\delta^{\prime})\cdot\sqrt{NT\gamma_{NT}}+\frac{4% BNT_{j}}{T}\right)$
	$\displaystyle\leq 2BM_{f}\cdot NT_{1}+J\cdot\left(8C_{f,\mathcal{X}}^{\prime}% \cdot\beta(\delta^{\prime})\cdot\sqrt{NT\gamma_{NT}}\right)+\frac{4BNM_{f}}{T}% \sum_{j=1}^{J}T_{j}$
	$\displaystyle\leq 2BM_{f}\cdot NT_{1}+J\cdot\left(8C_{f,\mathcal{X}}^{\prime}% \cdot\beta(\delta^{\prime})\cdot\sqrt{NT\gamma_{NT}}\right)+{4BNM_{f}}.$	(10)

We next optimize the length of the first epoch $T_{1}$ in order to achieve order optimal regret. DUETS achieves order optimal regret for $N\leq\max(T,\gamma_{NT})$ .

If $N<T$ we can choose $T_{1}=\sqrt{\frac{T}{N}}+\overline{M}(\delta^{{}^{\prime}})$ where $\delta^{{}^{\prime}}=\frac{\delta}{2(\log\log{NT}+4)}$ . Left hand side of equation (10) can now be written as $\widetilde{\mathcal{O}}(\sqrt{NT\gamma_{NT}}\beta(\delta^{{}^{\prime}}))\equiv% \widetilde{\mathcal{O}}\left(\sqrt{NT\gamma_{NT}}\left(\log{\frac{T}{\delta}}% \right)\right)$ .
If $N\leq\gamma_{NT}$ we can fix $T_{1}=\sqrt{T}+\overline{M}(\delta^{{}^{\prime}})$ . We have $NT_{1}\leq\widetilde{O}(\sqrt{NT\gamma_{NT}})$ and the left hand-side is once again $\widetilde{\mathcal{O}}\left(\sqrt{NT\gamma_{NT}}\left(\log{\frac{T}{\delta}}% \right)\right)$ . ∎

Note that by Lemma 4.4, $J$ is upper bounded by $\log(\log{N}\log{T})+4$ and is thus $\widetilde{O}(1)$ .

Before moving onto the proofs of Lemmas 4.2 and 4.4, we state two auxiliary lemmas that will be useful for our analysis.

Definition A.1.

Let $\mathcal{D}=\{x_{1},x_{2},\dots,x_{m}\}\subset\mathcal{X}$ be a collection $m$ points and $\mathcal{S}$ be any subset of $\mathcal{D}$ . Let $\sigma_{\mathcal{D}}^{2}(\cdot)$ denote the posterior variance corresponding to the points in $\mathcal{D}$ and $\tilde{\sigma}_{\mathcal{S}}^{2}(\cdot)$ denote the approximate posterior computed based on the points in $\mathcal{S}$ . We call $\mathcal{S}$ to be an $\varepsilon$ -accurate inducing set if the following relations are true for all $x\in\mathcal{X}$ .

\displaystyle\frac{1-\varepsilon}{1+\varepsilon}\cdot\tilde{\sigma}_{\mathcal{% S}}^{2}(x)\leq\sigma_{\mathcal{D}}^{2}(x)\leq\frac{1+\varepsilon}{1-% \varepsilon}\cdot\tilde{\sigma}_{\mathcal{S}}^{2}(x).

Lemma A.2 (Adapted from Calandriello et al. (2019)).

Let $\mathcal{D}=\{x_{1},x_{2},\dots,x_{m}\}\subset\mathcal{X}$ be a collection $m$ points and $\mathcal{S}$ be a random subset of $\mathcal{D}$ constructed by including each point with probability $p$ , independent of other points. Then $\mathcal{S}$ is an $\varepsilon$ -accurate inducing set with probability $\displaystyle 1-4m\exp\left(-\frac{3p\varepsilon^{2}}{8\sigma_{\max}^{2}}\right)$ , where $\sigma_{\max}^{2}=\sup_{x\in\mathcal{X}}\sigma_{\mathcal{D}}^{2}(x)$ .

Lemma A.3.

Let DUETS be run with a choice of $p_{0}=72\log(4NT/\delta^{\prime})$ . Then, for all epochs $j\geq 1$ , the global inducing set $\mathcal{S}_{j}$ is $0.5$ -accurate with probability $1-\delta$ .

Proof.

The statement is an immediate consequence of Lemma A.2 with the given choice of parameter $p_{0}$ . ∎

We are now ready to prove Lemmas 4.2 and 4.4.

A.2 Proof of Lemma 4.2

Proof.

Consider any epoch $j\geq 2$ and let $x\in\mathcal{X}_{j}$ . Let $\Delta(x):=f(x^{*})-f(x)$ . We will obtain a bound on $\Delta(x)$ for any general $x$ in order establish the bound on $\Delta_{j}$ . Using the discretization from Assumption 2.3 for $\mathcal{X}_{j}$ , we obtain,

	$\displaystyle\Delta(x)$	$\displaystyle=f(x^{*})-f(x)$
		$\displaystyle\leq f(x^{})-f([x^{}]_{\mathcal{U}_{T}})+f([x^{*}]_{\mathcal{U}% _{T}})-(f(x)-f([x]_{\mathcal{U}_{T}}))-f([x]_{\mathcal{U}_{T}})$
		$\displaystyle\leq f([x^{*}]_{\mathcal{U}_{T}})-f([x]_{\mathcal{U}_{T}})+\frac{% 2B}{T}.$

Using the result from Salgia et al. (2023b), we obtain the following high probability bound that holds with probability $1-\delta$ :

	$\displaystyle\Delta(x)$	$\displaystyle\leq f([x^{*}]_{\mathcal{U}_{T}})-f([x]_{\mathcal{U}_{T}})+\frac{% 2B}{T}$
		$\displaystyle\leq\tilde{\mu}_{j}([x^{}]_{\mathcal{U}_{T}})+\beta(\delta^{% \prime})\tilde{\sigma}_{j}([x^{}]_{\mathcal{U}_{T}})-\tilde{\mu}_{j}([x]_{% \mathcal{U}_{T}})+\beta(\delta^{\prime})\tilde{\sigma}_{j}([x]_{\mathcal{U}_{T% }})+\frac{2B}{T}$
		$\displaystyle\leq\tilde{\mu}_{j}(x^{})-\tilde{\mu}_{j}(x)+\beta(\delta^{% \prime})\tilde{\sigma}_{j}([x^{}]_{\mathcal{U}_{T}})+\beta(\delta^{\prime})% \tilde{\sigma}_{j}([x]_{\mathcal{U}_{T}})+\frac{4B}{T},$

where we again used Assumption 2.3 in the last step. We claim that $x^{*}\in\mathcal{X}_{j-1}$ for all $j\geq 2$ . Assuming this claim this true, we can bound the above expression as

	$\displaystyle\Delta(x)$	$\displaystyle\leq\sup_{x\in\mathcal{X}_{j-1}}\tilde{\mu}_{j}(x^{\prime})-% \tilde{\mu}_{j}(x)+\beta(\delta^{\prime})\tilde{\sigma}_{j}([x^{*}]_{\mathcal{% U}_{T}})+\beta(\delta^{\prime})\tilde{\sigma}_{j}([x]_{\mathcal{U}_{T}})+\frac% {4B}{T}$
		$\displaystyle\leq 2\beta(\delta^{\prime})\sigma_{j,\max}+\beta(\delta^{\prime}% )\tilde{\sigma}_{j}([x^{*}]_{\mathcal{U}_{T}})+\beta(\delta^{\prime})\tilde{% \sigma}_{j}([x]_{\mathcal{U}_{T}})+\frac{4B}{T},$

where we used the update condition (Eqn. (9)) in the second step. Since $\mathcal{S}_{j}$ is $0.5$ -accurate (Lemma A.3), we have $\tilde{\sigma}_{j}^{2}(x)\leq 3\sigma_{j}^{2}(x)\leq 3\sigma_{j,\max}^{2}$ . On plugging this back into the above equation, we obtain,

\displaystyle\Delta(x)

\displaystyle\leq 8\beta(\delta^{\prime})\sigma_{j,\max}+\frac{4B}{T}.

The statement of the lemma follows by $\Delta_{j}=\sup_{x\in\mathcal{X}_{j}}\Delta(x)$ .

We prove our claim $x^{*}\in\mathcal{X}_{j}$ for all $j\geq 1$ using induction. Clearly, $x^{*}\in\mathcal{X}_{1}=\mathcal{X}$ , by definition. Assume $x^{*}\in\mathcal{X}_{j-1}$ for some $j\geq 2$ . Fix an arbitrary $x\in\mathcal{X}_{j-1}$ , from the confidence bound lemma we have:

\displaystyle\mu_{j-1}(x)-\mu_{j-1}(x^{*})\leq

\displaystyle(f(x)-f(x^{*}))+\beta(\delta^{{}^{\prime}})(\sigma_{j}(x)+\sigma_% {j}(x^{*}))\leq 2\sigma_{j-1.\mathrm{max}}(x),

where the second inequality follows as $f(x)\leq f(x^{*})$ . As the inequality holds $\forall x\in\mathcal{X}_{j-1}$ we must have:

\displaystyle\sup_{x\in\mathcal{X}_{j-1}}\mu_{j-1}(x)-\mu_{j-1}(x^{*})\leq 2% \sigma_{j-1.\mathrm{max}}(x)

and thus indeed $x\in\mathcal{X}_{j}$ .

∎

A.3 Proof of Lemma 4.4

We define $E(s):=\min\{j:T_{j}\geq T/4\ |\ T_{1}=s\}$ . Note that $T_{j}$ is an increasing function of $j$ . Since $T_{E(s)}\geq T/4$ , we can conclude that $E(s)+4$ is an upper bound on the number of epochs. Thus, we focus on bounding $E(s)$ . We first show that $E(s)$ is a non-decreasing function of $s$ .

To that effect, we claim that for $j\geq 2$ the epoch lengths satisfy the relation $T_{j}\geq T^{1-2^{-j+1}}\cdot T_{1}^{2^{-j+1}}$ . This relation follows immediately using induction. For the base case, note that $T_{2}\geq T^{1/2}\cdot T_{1}^{1/2}$ , by definition. Assume that the relation holds for $j-1$ . Thus,

\displaystyle T_{j}\geq T^{1/2}\cdot T_{j-1}^{1/2}\geq T^{1/2}\cdot T^{1-2^{-(% j-1)+1-1}}\cdot T_{1}^{2^{-(j-1)+1-1}}\geq T^{1-2^{-j+1}}\cdot T_{1}^{2^{-j+1}}.

(11)

Since $T_{j}$ ’s are lower bounded by an increasing function of $T_{1}$ , the number of epochs $E(s)$ is a non-increasing function of $s$ . Since $T_{1}\geq\frac{T}{N}$ , $E\left(\frac{T}{N}\right)$ is an upper bound on the number of epochs for all choices of $T_{1}$ .

Let $j^{*}=\max\{\log(\log(T)),\log(\log(N))\}$ . Using the above relation for $T_{j}$ from Eqn. (11) and the lower bound on $T_{1}$ , we have,

\displaystyle T_{j^{*}}\geq T\cdot{N^{-2^{1-j}}}=T\cdot\left(2^{-\frac{\log{N}% }{2^{j}}}\right)^{2}\geq T\cdot 2^{-2}

We can hence conclude that $T_{j^{*}}\geq T/4$ , which implies that $E(T_{1})\leq j^{*}$ for all permissible choices of $T_{1}$ . Consequently, the number of epochs are bounded as $\log(\log(\max\{N,T\}))+4$ .

A.4 Proof of Lemma 4.5

For all epochs $j\geq 1$ , recall that the inducing set is constructed by including each point from $\mathcal{D}_{j}$ with probability $p_{j}$ , independent of other points. Thus, $|\mathcal{S}_{j}|$ is a binomial random variable with parameters $|\mathcal{D}_{j}|=NT_{j}$ and $p_{j}$ . Using the Chernoff bound for Binomial random variables, we can conclude that

\displaystyle\Pr(|\mathcal{S}_{j}|>(1+\varepsilon)NT_{j}p_{j})\leq\exp\left(-% \frac{\varepsilon^{2}NT_{j}p_{j}}{2+\varepsilon}\right).

Invoking the bound with $\varepsilon=2+\log(1/\delta^{\prime})$ , with $\delta^{\prime}=\delta/(\log\log(NT)+4)$ yields that the following relation holds with probability $1-\delta^{\prime}$ :

	$\displaystyle\|\mathcal{S}_{j}\|$	$\displaystyle\leq(3+\log(1/\delta^{\prime}))\cdot NT_{j}p_{j}$
		$\displaystyle\leq(3+\log(1/\delta^{\prime}))\cdot NT_{j}\cdot p_{0}\sigma_{j,% \max}^{2}$
		$\displaystyle\leq(3+\log(1/\delta^{\prime}))\cdot NT_{j}p_{0}\cdot C_{f,% \mathcal{X}}\cdot\frac{\gamma_{NT_{j}}}{NT_{j}}$
		$\displaystyle\leq(3+\log(1/\delta^{\prime}))p_{0}\gamma_{NT},$

where we used Lemma 4.3 in the third step and monotonicity of $\gamma_{m}$ in the last step. On taking a union bound over all epochs and using the bound on the number of epochs from Lemma 4.4, we conclude that for all epochs $j$ , $|\mathcal{S}_{j}|=\tilde{\mathcal{O}}(\gamma_{NT})$ with probability $1-\delta$ .