Random Exploration in Bayesian Optimization: Order-Optimal Regret and Computational Efficiency

Sudeep Salgia School of Electrical & Computer Engineering, Cornell University, Ithaca, NY, {ss3827,qz16}@cornell.edu Sattar Vakili MediaTek Research, UK, [email protected] Qing Zhao School of Electrical & Computer Engineering, Cornell University, Ithaca, NY, {ss3827,qz16}@cornell.edu

(Oct 2023; Revised Feb 2024)

Abstract

We consider Bayesian optimization using Gaussian Process models, also referred to as kernel-based bandit optimization. We study the methodology of exploring the domain using random samples drawn from a distribution. We show that this random exploration approach achieves the optimal error rates. Our analysis is based on novel concentration bounds in an infinite dimensional Hilbert space established in this work, which may be of independent interest. We further develop an algorithm based on random exploration with domain shrinking and establish its order-optimal regret guarantees under both noise-free and noisy settings. In the noise-free setting, our analysis closes the existing gap in regret performance and thereby resolves a COLT open problem. The proposed algorithm also enjoys a computational advantage over prevailing methods due to the random exploration that obviates the expensive optimization of a non-convex acquisition function for choosing the query points at each iteration.

1 Introduction

1.1 GP-based Bayesian Optimization

We consider the problem of sequential optimization of an unknown, possibly non-convex, function $f:\mathcal{X}\to\mathbb{R}$ . The learner sequentially chooses a query point $x_{t}\in\mathcal{X}$ at each time $t$ and observes the function value (potentially subject to noise) at $x_{t}$ . The learning objective is to approach a global maximizer $x^{*}$ of the function through a sequence of query points $\{x_{t}\}_{t=1}^{T}$ chosen sequentially in time. In addition to the convergence of $\{x_{t}\}_{t=1}^{T}$ to $x^{*}$ , an online measure of the learning efficiency is the cumulative regret

\displaystyle R(T)=\sum_{t=1}^{T}\left[f(x^{*})-f(x_{t})\right].

(1)

The above problem finds a wide range of applications including hyperparameter optimization Li et al. (2016), experimental design Greenhill et al. (2020), recommendation systems Vanchinathan et al. (2014) and robotics Lizotte et al. (2007). An approach that has proven to be particularly effective is Bayesian Optimization (BO) using Gaussian Process (GP) models (a.k.a. kernel-based bandit optimization). The unknown objective function $f$ is assumed to live in a Reproducing Kernel Hilbert Space (RKHS) associated with a known kernel. Within the GP-based BO framework, $f$ is viewed as a realization of a Gaussian process over $\mathcal{X}$ . With each new query $x_{t}$ , the learner sharpens the posterior distribution and uses it as a proxy for $f$ for subsequent optimization. We point out that such a Bayesian approach is equally applicable to a frequentist formulation where $f$ is deterministic as considered in this work. In this case, the GP model of $f$ is fictitious and internal to the algorithm.

Under the assumption of noise-free query feedback, BO techniques were used for optimization as early as 1964 Kushner (1964). GP-based BO was popularized through the work of Močkus et al. (1978). Since then, a number of approaches have been developed and analyzed over the years, often under certain conditions on the kernels and functional characteristics around $x^{*}$ (see Sec. 1.3 for a detailed discussion). Surprisingly, despite the long history, an algorithm with guaranteed order-optimal regret performance remains open as discussed in Vakili (2022).

GP-based BO under noisy query was studied much more recently, following the pioneering work by Srinivas et al. (2010) where they proposed the celebrated GP-UCB algorithm. Extensive studies since then have fully characterized the achievable learning performance, both in terms of information-theoretic lower bounds Scarlett et al. (2017) and the design of algorithms such as SupKernel-UCB Valko et al. (2013), GP-ThreDS Salgia et al. (2021), BPE Li and Scarlett (2022), and RIPS Camilleri et al. (2021) that achieve the optimal performance.

Under both the noise-free and noisy settings, a key practical concern for GP-based algorithms is their computational cost. The major computational bottleneck of prevailing GP-based algorithms is the maximization of an acquisition function for choosing the query point at each time instant. The acquisition functions are often non-convex and computationally expensive to maximize. To achieve low regret order, such an optimization often needs to be carried out with increasing accuracy as time goes, resulting in a high overall computational requirement.

1.2 Main Results

We explore a new design methodology for GP-based BO: an open-loop exploration of the domain using query points sampled at random from an arbitrary probability distribution supported over the domain. We show that this random exploration approach, while simplistic in nature, leads to order-optimal regret guarantees under both noise-free and noisy feedback models, thus closing the long standing regret gap in the noise-free setting. Moreover, the non-adaptive nature of random sampling bypasses the expensive step of optimizing a non-convex acquisition function, offering a computationally efficient solution without sacrificing learning efficiency.

Random exploration, while not new to many problems (see Sec. 1.3), has not been considered or analyzed for GP-based BO. It stands in sharp contrast to the prevailing exploratory query strategy in GP-based BO: the maximum posterior variance (MPV) sampling. Under MPV, the learning algorithm at each time queries the point with the highest posterior variance conditioned on past observations, i.e., a greedy approach to maximal uncertainty reduction. Surprisingly, we show that the simple, non-adaptive scheme of random exploration achieves the same order of predictive performance as MPV sampling, which is known to be order-optimal. In particular, we show that the worst-case posterior variance corresponding to $n$ randomly drawn points is bounded with high probability by $\tilde{\mathcal{O}}(\gamma_{n}/n)$ and $\tilde{\mathcal{O}}(n^{1-\beta})$ under noisy and noise-free feedback models, where $\gamma_{n}$ is the maximal information gain from $n$ query points and $\beta>1$ is the order of the polynomial eigendecay of the kernel (see Sec. 2 for their definitions).

A simpler solution is often more demanding when it comes to establishing optimality in performance. The drastically different nature of random exploration from MPV demands different analytical techniques in characterizing its predictive performance. The tightest bound on the worst-case predictive error of MPV sampling, derived in Wenzel et al. (2021), was obtained using the results on scattered data interpolation (i.e., approximating an unknown function using a given set of points) of functions in Sobolev spaces that provide bounds on the worst-case estimation error of the best interpolant based on the fill distance of the given set of points Wendland (2004); Narcowich et al. (2006); Brenner et al. (2008); Arcangéli et al. (2012); Wenzel et al. (2021). Since RKHSs of Matérn kernels are norm-equivalent to Sobolev spaces, these results also immediately translate to estimation errors for function interpolation in RKHSs. The analytical techniques used in these studies require various technical assumptions on the regularity of the function domain and its boundary. These technical assumptions on the function domain present major challenges in incorporating MPV sampling with effective optimization techniques such as domain shrinking/elimination, hindering its potential applicability in designing algorithms with optimal regret. In contrast, in analyzing random exploration, we establish the concentration of the spectrum of the sample covariance operator to that of the true covariance operator that holds universally for all compact domains. The crux of our analysis builds upon a careful treatment of the infinite-dimensional operators to separately ensure the concentration of the initial spectrum (consisting of the larger eigenvalues) and the tail spectrum, which allows us to obtain optimal convergence rate. The simplicity of random exploration in its implementation and the generality in its guaranteed predictive performance as established in this work make this exploration strategy an attractive alternative to MPV. We believe that the tools and techniques established here are of independent interest for extending the methodology of random exploration to other problem fields.

Built upon the above key results on random exploration, we develop and analyze a new algorithm for GP-based BO. Referred to as Random Exploration with Domain Shrinking (REDS), this algorithm integrates the exploration strategy of random sampling with the optimization technique of domain shrinking Li and Scarlett (2022); Salgia et al. (2021). Under the noise-free feedback model, we show that REDS incurs a cumulative regret of $\tilde{\mathcal{O}}(\max\{T^{(3-\beta)/2},1\})$ , which closes the gap to the known lower bound established in Tuo and Wang (2020) and hence resolves the longstanding open problem. The generality of random exploration, both in terms of the design methodology and performance guarantee is the reason behind the optimal regret performance of REDS. In particular, the order-optimal predictive performance of random exploration that holds universally over all compact domain enables a seamless integration of this exploration strategy with domain shrinking. Similarly, in the noisy setting, we show that REDS offers a cumulative regret of $\tilde{\mathcal{O}}(\sqrt{T\gamma_{T}})$ , which is order-optimal up to logarithmic factors.

The computational advantage of REDS is evident due to the simplicity of random exploration. We further demonstrate this with empirical studies where we compare REDS with BPE Li and Scarlett (2022) and GP-ThreDS Salgia et al. (2021), all offering optimal regret performance. GP-ThreDS was shown to be computationally more efficient than prevailing algorithms such as GP-UCB. We show that REDS offers a significant speed-up in running time over both algorithms without compromising the regret performance. As shown in Table 1, REDS offers a $\sim 15\times$ and $\sim 100\times$ speed-up in runtime over GP-ThreDS and BPE, respectively.

1.3 Related Work

For GP-based BO with noise-free feedback, a number of algorithms such as GP-EI Močkus (1975), EGO Jones et al. (1998), knowledge-gradient policy Frazier et al. (2008), and GP-PI Kushner (1964); Törn and Žilinskas (1989); Jones (2001) have been proposed, which have since become classical. We refer the reader to the excellent tutorial by Brochu et al. (2010) for a more detailed description of the classical approaches. Despite their good empirical performance and popularity, theoretical guarantee on the convergence of these algorithms has only been established relatively recently. Vazquez and Bect (2010) showed that EI converges almost surely for any function drawn from a GP prior of finite smoothness. Grünewälder et al. (2010) established the convergence rate of a computationally infeasible version of EI. Later, Bull (2011) established convergence rates for the computationally feasible version, showing that GP-EI achieves the optimal simple regret for Matérn kernels with smoothness $\nu<1$ , which does not translate to optimal cumulative regret performance. More recently, De Freitas et al. (2012) proposed the Branch and Bound algorithm that achieves a constant cumulative regret in Bayesian setting under additional assumptions on the differentiability of the kernel and the behaviour around the unique global maximum, which in practice are difficult to verify. In contrast, REDS requires no such additional assumptions and is analyzed in the frequentist setting. Lyu et al. (2020) showed that for kernels with a polynomial eigendecay with parameter $\beta$ (See Definition 2.2), the GP-UCB algorithm achieves a regret of $\mathcal{O}(T^{\frac{1+\beta}{2\beta}})$ , which is sub-optimal, as shown in Vakili (2022).

The idea of using random sampling has been explored in related fields. The reconstruction of square integrable functions using random samples is a well-studied problem Bohn and Griebel (2017); Bastian Bohn (2017); Bohn (2018); Smale and Zhou (2004); Cohen et al. (2013); Chkifa et al. (2015); Cohen and Migliorati (2017). In particular, a series of studies considers efficient reconstruction of functions in RKHS using random samples drawn from the domain Kämmerer et al. (2021); Krieg and Ullrich (2021a, b); Moeller and Ullrich (2021). Despite certain similarities in the problem setup, an important point of distinction is that these studies focus on bounding the $L_{2}$ error of the reconstruction. In this work, we focus on bounding the sup-norm (or equivalently, $L_{\infty}$ norm) of the estimation error, which is larger than the $L_{2}$ norm and more challenging than bounding the $L_{2}$ norm. Since the analysis of algorithms requires a bound on the sup-norm of the estimation error, existing results are not applicable here.

2 Problem Statement

2.1 RKHS and Mercer’s Theorem

Let $\mathcal{X}$ be a compact subset of $\mathbb{R}^{d}$ and $\varrho$ a finite Borel measure supported on $\mathcal{X}$ . A measure $\varrho$ is said to be supported on $\mathcal{X}$ if $\varrho(\mathcal{Y})>0$ for all open sets $\mathcal{Y}\subset\mathcal{X}$ . For $\mathcal{X}\subset\mathbb{R}^{d}$ , this is equivalent to $\varrho$ being absolutely continuous w.r.t. the Lebesgue measure. Let $L_{2}(\varrho,\mathcal{X})$ denote the Hilbert space of (real) functions defined over $\mathcal{X}$ that are square-integrable w.r.t. $\varrho$ ¹¹1To be rigorous, each $f\in L_{2}(\varrho,\mathcal{X})$ represents the class of functions that are equivalent $\varrho$ -everywhere..

Consider a positive definite kernel $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ . A Hilbert space $\mathcal{H}_{k}$ of functions on $\mathcal{X}$ equipped with an inner product $\langle\cdot,\cdot\rangle_{\mathcal{H}_{k}}$ is called a Reproducing Kernel Hilbert Space (RKHS) with reproducing kernel $k$ if the following conditions are satisfied: (i) $\forall\ x\in\mathcal{X}$ , $k(\cdot,x)\in\mathcal{H}_{k}$ ; (ii) $\forall\ x\in\mathcal{X}$ , $\forall\ f\in\mathcal{H}_{k}$ , $f(x)=\langle f,k(\cdot,x)\rangle_{\mathcal{H}_{k}}$ . For simplicity, we use $\psi_{x}$ to denote $k(\cdot,x)$ . The inner product induces the RKHS norm, $\|f\|_{\mathcal{H}_{k}}^{2}=\langle f,f\rangle_{\mathcal{H}_{k}}$ . WLOG, we assume that $k(x,x)=\|\psi_{x}\|_{\mathcal{H}_{k}}^{2}\leq 1$ . For brevity, we drop the subscript of $\mathcal{H}_{k}$ from the inner product for the rest of the paper.

Mercer’s Theorem provides an alternative representation for RKHSs through the eigenvalues and eigenfunctions of a kernel integral operator defined over $L_{2}(\varrho,\mathcal{X})$ using the kernel $k$ .

Theorem 2.1.

(Steinwart and Christmann, 2008, Theorem 4.49) Let $\mathcal{X}$ be a compact metric space, $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ be a continuous kernel and $\varrho$ be a finite Borel measure supported on $\mathcal{X}$ . Then, there exists an orthonormal system of functions $\{\varphi_{j}\}_{j\in\mathbb{N}}$ in $L_{2}(\varrho,\mathcal{X})$ and a sequence of non-negative values $\{\lambda_{j}\}_{j\in\mathbb{N}}$ satisfying $\lambda_{1}\geq\lambda_{2}\geq\dots\geq 0$ , such that $\displaystyle k(x,x^{\prime})=\sum_{j\in\mathbb{N}}\lambda_{j}\varphi_{j}(x)% \varphi_{j}(x^{\prime})$ holds for all $x,x^{\prime}\in\mathcal{X}$ and the convergence is absolute and uniform over $x,x^{\prime}\in\mathcal{X}$ . Moreover, $\{(\lambda_{j},\varphi_{j})\}_{j\in\mathbb{N}}$ corresponds to the eigensystem of the kernel integral operator $T_{k}:L_{2}(\varrho)\to L_{2}(\varrho)$ given by $T_{k}f=\int_{\mathcal{X}}k(\cdot,x)f(x)d\varrho(x)$ for all $f\in L_{2}(\varrho)$ .

Consequently, the Mercer representation (Steinwart and Christmann, 2008, Thm. 4.51) of the RKHS of $k$ is given as

\displaystyle\mathcal{H}_{k}=\left\{f:=\sum_{j\in\mathbb{N}}\alpha_{j}{\lambda% _{j}}^{\frac{1}{2}}\varphi_{j}:\|f\|_{\mathcal{H}_{k}}^{2}=\sum_{j\in\mathbb{N% }}\alpha_{j}^{2}<\infty\right\}.

This also implies that $\{\upsilon_{j}\}_{j\in\mathbb{N}}$ with $\upsilon_{j}=\sqrt{\lambda_{j}}\varphi_{j}$ is an orthonormal basis for $\mathcal{H}_{k}$ . The following definition characterizes a class of kernels based on their eigendecay profile corresponding to their Mercer representation.

Definition 2.2.

Let $\{\lambda_{j}\}_{j\in\mathbb{N}}$ denote the eigenvalues of a kernel $k$ arranged in the descending order. The kernel $k$ is said to satisfy the polynomial eigendecay condition with a parameter $\beta>1$ if, for some universal constant $C>0$ , we have $\lambda_{j}\leq Cj^{-\beta}$ for all $j\in\mathbb{N}$ .

The above class of kernels encompasses a large number of kernels including the widely used Matérn family. We make the following assumption on the kernel $k$ which is commonly adopted in the literature Vakili et al. (2021b); Chatterji et al. (2019); Riutort-Mayol et al. (2023).

Assumption 2.3.

The eigenfunctions $\{\varphi_{j}\}_{j\in\mathbb{N}}$ corresponding to $k$ are continuous and hence bounded on $\mathcal{X}$ , i.e., there exists $F>0$ such that $\sup_{x\in\mathcal{X}}|\varphi_{j}(x)|\leq F$ for all $j\in\mathbb{N}$ .

2.2 Problem Formulation

We consider the problem of optimizing a fixed and unknown function $f:\mathcal{X}\to\mathbb{R}$ , where $\mathcal{X}\subset\mathbb{R}^{d}$ is a compact domain and $f\in\mathcal{H}_{k}$ with $\|f\|_{\mathcal{H}_{k}}\leq B$ . A sequential optimization algorithm chooses a point $x_{t}\in\mathcal{X}$ at each time $t$ and observes $y_{t}=f(x_{t})+\varepsilon_{t}$ . In the noise-free setting, $\varepsilon_{t}\equiv 0$ for all $t$ . For the noisy setting, we assume that $\{\varepsilon_{t}\}_{t=1}^{T}$ are independent, zero-mean, $R$ -sub Gaussian random variables for some fixed constant $R\geq 0$ , i.e., $\mathbb{E}[\exp(\zeta\varepsilon_{t})]\leq\exp(\zeta^{2}R^{2}/2)$ , for all $\zeta\in\mathbb{R}$ and $t\leq T$ . The performance of the sequential algorithm is measured using the notion of cumulative regret, as defined in Eqn. (1).

2.3 Preliminaries on Gaussian Processes

Under the GP model, the unknown function $f$ is treated hypothetically as a realization of $\text{GP}(0,k)$ , a Gaussian Process over $\mathcal{X}$ with zero mean and $k(\cdot,\cdot)$ as the covariance kernel. The noise terms $\varepsilon$ are also viewed as zero mean Gaussian variables with variance $\tau$ . The conjugate property of GPs with Gaussian noise allows for a closed form expression of the posterior distribution. Specifically, let $\mathcal{Z}_{t}=\{(x_{i},y_{i})\}_{i=1}^{t}$ denote a collection of points and their corresponding observations obtained according to the model described in Sec. 2.2. Then, conditioned on $\mathcal{Z}_{t}$ , the posterior distribution of $f$ is also a GP with the following mean and covariance functions:

	$\displaystyle\mu_{t,\tau}(x)$	$\displaystyle=k_{X_{t},x}^{\top}(K_{X_{t},X_{t}}+\tau I_{t})^{-1}Y_{t},$		(2)
	$\displaystyle k_{t,\tau}(x,\bar{x})$	$\displaystyle=k(x,\bar{x})-k_{X_{t},x}^{\top}(K_{X_{t},X_{t}}+\tau I_{t})^{-1}% k_{X_{t},\bar{x}},$		(3)

where $k_{X_{t},x}=[k(x_{1},x),\dots k(x_{t},x)]^{\top}$ , $Y_{t}=[y_{1},\dots,y_{t}]^{\top}$ , $K_{X_{t},X_{t}}=[k(x_{i},x_{j})]_{i,j=1}^{t}$ and $I_{t}$ is the $t\times t$ identity matrix. The posterior variance at a point $x$ is given as $\sigma_{t,\tau}^{2}(x)=k_{t,\tau}(x,x)$ . The expression for posterior mean and variance in the noise-free setting is simply obtained by setting $\tau=0$ in the above relations.

The posterior mean and variance computed using the GP model above are powerful tools to predict the values of the unknown function $f$ and to quantify the uncertainty in the prediction. In particular, the prediction error at a point $x\in\mathcal{X}$ , $|f(x)-\mu_{t,\tau}(x)|$ , can be upper bounded by $\alpha\sigma_{t,\tau}(x)$ , for a certain scaling factor $\alpha>0$ that depends on the feedback model Vakili et al. (2021a).

Lastly, we define the information gain of a set of points $X_{n}=\{x_{1},x_{2},\dots,x_{n}\}$ as

\displaystyle\tilde{\gamma}_{X_{n},\tau}:=\frac{1}{2}\log\left(\det\left(I_{t}% +\tau^{-1}K_{X_{n},X_{n}}\right)\right).

(4)

Similarly, we define the maximal information gain as $\gamma_{n,\tau}:=\sup_{X_{n}\subset\mathcal{X}^{n}}\tilde{\gamma}_{X_{n},\tau}$ . Maximal information gain is an important term that corresponds to the effective dimension of the kernel and helps characterize the regret of the algorithms. It depends only on the kernel and $\tau$ .

3 The Predictive Performance of Random Exploration

The following theorem characterizes the predictive variance, and consequently the predictive error, of a set of randomly sampled points from the domain.

Theorem 3.1.

Let $\mathcal{X}$ be a compact subset of $\mathbb{R}^{d}$ , $\varrho$ be a finite Borel measure supported on $\mathcal{X}$ , and $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ be a continuous kernel satisfying the polynomial eigendecay condition with parameter $\beta>1$ (Defn. 2.2). Let $X_{n}=\{x_{1},x_{2},\dots,x_{n}\}$ denote a collection of $n$ i.i.d. points drawn from $\mathcal{X}$ according to $\varrho$ . Let $\sigma_{n,0}^{2}$ and $\sigma_{n,\tau}^{2}$ denote, respectively, the posterior variance conditioned on $X_{n}$ in the noise-free setting and the noisy setting with a noise variance of $\tau>0$ . Then, for a given $\delta\in(0,1)$ , there exists a constant $\overline{N}(\delta,k,\varrho,\tau)>0$ , such that, with probability at least $1-\delta$ , for all $n>\overline{N}(\delta,k,\varrho,\tau)$ ,

	$\displaystyle\sup_{x\in\mathcal{X}}\sigma_{n,\tau}^{2}(x)$	$\displaystyle=\mathcal{O}\left(\frac{\tau\gamma_{n,\tau}}{n}\right)=\tilde{% \mathcal{O}}((n/\tau)^{\frac{1}{\beta}-1}),$
	$\displaystyle\sup_{x\in\mathcal{X}}\sigma_{n,0}^{2}(x)$	$\displaystyle=\tilde{\mathcal{O}}(n^{1-\beta}).$

The above obtained bounds on the worst-case posterior variance under the random exploration scheme are order-optimal (up to polylogarithmic factors), matching the existing lower bounds Scarlett et al. (2017); Tuo and Wang (2020). The above theorem also improves upon the best known results for noisy scattered data approximation. In particular, for the class of Matérn kernels with smoothness $\nu$ (i.e., $\beta=(2\nu+d)/d$ ), Theorem 3.1 implies a worst-case predictive error of $\tilde{\mathcal{O}}(n^{-\frac{\nu}{2\nu+d}})$ , improving upon the bound of $\tilde{\mathcal{O}}(n^{-\frac{\nu}{2\nu+2d}})$ established by Wynne et al. (2021, Corollary 3).

The constant $\overline{N}(\delta,k,\varrho,\tau)$ is related to the kernel $k$ and measure $\varrho$ through two fundamental functions, $N(R)$ and $T(R)$ , which are given as follows for any $R\in\mathbb{N}$ :

	$\displaystyle N(R)$	$\displaystyle:=\sup_{x\in\mathcal{X}}\sum_{j=1}^{R}\varphi_{j}^{2}(x),$
	$\displaystyle T(R)$	$\displaystyle:=\sup_{x\in\mathcal{X}}\sum_{j=R+1}^{\infty}\lambda_{j}\varphi_{% j}^{2}(x)=\sup_{x\in\mathcal{X}}\sum_{j=R+1}^{\infty}\upsilon_{j}^{2}(x).$

They are referred to as the spectral functions of the kernel (see Gröchenig (2020) and references therein) because of their dependence on the eigensystem corresponding to the kernel $k$ induced by the measure $\varrho$ . Both $N(R)$ and $T(R)$ are fundamental quantities that appear in the analysis of reconstruction and estimation of functions in general $L_{2}$ spaces. The function $N(R)$ corresponds to the inverse of the infimum of the Christoffel function Dunkl and Xu (2014) in the special case of reconstruction using orthogonal polynomials. Under Assumption 2.3 and the condition of polynomial eigendecay (Def. 2.2), $\overline{N}(\delta,k,\varrho,\tau)$ can be shown to be bounded as $\mathcal{O}(\max\{F^{4},(F^{2}/\tau)^{\frac{1}{\beta-1}}\}\log(F/\delta))$ . The dependence of $\overline{N}(\delta,k,\varrho,\tau)$ on $\delta$ is mild, as evident from the previous expression. Lastly, $\overline{N}(\delta,k,\varrho,\tau)$ is inversely proportional to $\tau$ . Note that Theorem 3.1 ensures that a smaller value of $\tau$ results in a tighter bound on the posterior variance, which in turn requires a larger number of samples. We refer the interested reader to the Appendix A for a more detailed discussion of $\overline{N}(\delta,k,\varrho,\tau)$ and its dependence on $N(R)$ and $T(R)$ . For brevity, we drop the arguments and use the notation $\overline{N}$ in the rest of the paper.

We provide a sketch of the proof of Theorem 3.1 below and refer the reader to Appendix A for a detailed proof.

Proof.

The main idea of the proof is to relate the worst-case posterior variance conditioned on $X_{n}$ to $\tilde{\gamma}_{X_{n},\tau}$ . This relation is established in two parts. In the first part, we establish that as the number of samples grow, the spectrum of random operator $\hat{\mathbf{Z}}$ concentrates to that of $\mathbf{Z}$ , where $\hat{\mathbf{Z}},\mathbf{Z}:\mathcal{H}_{k}\to\mathcal{H}_{k}$ are defined as follows:

\displaystyle\hat{\mathbf{Z}}g:=\left[\sum_{i=1}^{n}\langle g,\psi_{x_{i}}% \rangle\psi_{x_{i}}\right]+\tau g;\quad\mathbf{Z}:=\mathbb{E}_{X_{n}}[\hat{% \mathbf{Z}}],

where $\{x_{1},x_{2},\dots,x_{n}\}$ denotes the random ensemble of points drawn according to the measure $\varrho$ . The concentration in spectral norm allows us to approximate the expression of $\sigma_{n,\tau}^{2}(x)=\tau\langle{\psi_{x}},{\hat{\mathbf{Z}}^{-1}\psi_{x}}\rangle$ as $\sigma_{n,\tau}^{2}(x)\approx\tau\langle{\psi_{x}},{{\mathbf{Z}}^{-1}\psi_{x}}\rangle$ , i.e., by replacing the sample covariance operator, $\hat{\mathbf{Z}}$ , with the true covariance operator, $\mathbf{Z}$ . Here, $A^{-1}$ denotes the inverse of an operator $A$ , i.e., $A\circ A^{-1}=A^{-1}\circ A=\mathbf{Id}$ and $\mathbf{Id}$ denotes the identity operator. Thus, this step allows us to obtain a deterministic bound on posterior variance, which is easier to understand and analyze. We establish the required relation using the following two lemmas:

Lemma 3.2.

For all $n\geq\overline{N}$ , the following relation holds with probability $1-\delta/2$ :

\displaystyle\|\mathbf{Z}^{-\frac{1}{2}}\hat{\mathbf{Z}}\mathbf{Z}^{-\frac{1}{% 2}}-\mathbf{Id}\|_{2}\leq 1/9.

Lemma 3.3.

If the relation $\|\mathbf{Z}^{-\frac{1}{2}}\hat{\mathbf{Z}}\mathbf{Z}^{-\frac{1}{2}}-\mathbf{% Id}\|_{2}\leq b$ is true for some $b\in(0,1/3)$ , then following is true $\forall\ x\in\mathcal{X}$ :

\displaystyle\langle{\psi_{x}},{\hat{\mathbf{Z}}^{-1}\psi_{x}}\rangle\leq\frac% {\sqrt{1-b}}{\sqrt{1-b}-\sqrt{2b}}\cdot\langle{\psi_{x}},{{\mathbf{Z}}^{-1}% \psi_{x}}\rangle.

Lemma 3.2 forms the cornerstone of the proof of the theorem. The result is established by bounding the expression $|\langle g,(\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})g\rangle|$ for an arbitrary $g$ with $\|g\|_{\mathcal{H}_{k}}=1$ . We bound the above expression by decomposing it into a sum of three terms. Each of the three terms is then carefully bounded using a combination of Matrix-Chernoff inequality (Tropp, 2012, Theorem 1.1), a result for spectral norm concentration based on non-commutative Khinchtine inequality Buchholz (2001, 2005); Moeller and Ullrich (2021) and Bernstein inequality. Lemma 3.3 is established using a combination the structure of covariance matrices, the Cauchy-Schwarz inequality and the relation between the operator norm and $2$ -norm. We would like to emphasize that both the above lemmas are true in general for all eigendecay profiles and even without Assumption 2.3 being true.

In the second part, we show that, with high probability, the information gain of the (random) set $X_{n}$ is lower bounded by $n\cdot\sup_{x\in\mathcal{X}}\langle{\psi_{x}},{{\mathbf{Z}}^{-1}\psi_{x}}\rangle$ , upto a multiplicative constant. The above idea is formalized in the following lemma.

Lemma 3.4.

For all $n\geq\overline{N}$ , the following relation holds with probability $1-\delta/2$ :

\displaystyle\tilde{\gamma}_{X_{n},\tau}\geq\frac{13}{54F^{2}}\cdot n\cdot\sup% _{x\in\mathcal{X}}\langle{\psi_{x}},{{\mathbf{Z}}^{-1}\psi_{x}}\rangle.

Thus $\langle{\psi_{x}},{{\mathbf{Z}}^{-1}\psi_{x}}\rangle$ serves as the bridge for connecting the posterior variance to maximal information gain.

The result for the noisy case follows immediately from the above lemmas by noting that $\gamma_{X_{n},\tau}\leq\gamma_{n,\tau}$ . For the noise-free setting, the results do not carry forward immediately as the above analysis does not hold for $\tau=0$ . To circumvent this issue, we use the fact that $\sigma_{n,\tau}^{2}(x)$ is an increasing function of $\tau$ . Thus, we obtain a bound on $\sigma_{n,0}^{2}(x)$ by using the bound on $\sigma_{n,\tau^{*}}^{2}(x)$ , where $\tau^{*}$ is a carefully chosen value that not only allows us to use the analysis from the noisy case but also ensures that $\sigma_{n,\tau^{*}}^{2}$ is a close representation of $\sigma_{n,0}^{2}$ to guarantee tightest possible bounds. ∎

Remark 3.5.

We would like to emphasize that the above result holds for samples generated under every finite Borel measure $\varrho$ supported on $\mathcal{X}$ . However, the quality of the estimate changes with the choice of the measure through the leading constant in the bound in Theorem 3.1.

4 The REDS algorithm

In this section, we present the proposed algorithm and analyze its regret performance.

4.1 REDS with Noise-Free Feedback

REDS integrates random exploration with domain shrinking. It proceeds in epochs, maintaining an active region $\mathcal{X}_{r}$ of the domain during each epoch $r\geq 1$ . The sequence of active regions $\{\mathcal{X}_{r}\}_{r}$ shrinks across epochs, i.e., $\mathcal{X}_{r}\subseteq\mathcal{X}_{r-1}\subseteq\dots\mathcal{X}_{1}=% \mathcal{X}$ , while ensuring $x^{*}\in\mathcal{X}_{r}$ for all $r$ with high probability. During the $r^{\text{th}}$ epoch, REDS samples $N_{r}$ points, uniformly at random from the set $\mathcal{X}_{r}$ ²²2If $\mathcal{X}_{r}$ consists of multiple disjoint regions, then we carry out this step for each region separately., where $N_{r}=N_{1}\cdot 2^{r-1}$ and the initial batch size $N_{1}$ is an input to the algorithm.

Using the observations from these points, REDS computes the posterior mean and variance function over $\mathcal{X}_{r}$ , denoted by $\mu_{r}$ and $\sigma^{2}_{r}$ respectively, using the Equations (2) and (3) with $\tau=0$ . The posterior mean and variance are then used to obtain $\mathcal{X}_{r+1}$ , an improved localization of $x^{*}$ , as follows:

\displaystyle\mathcal{X}_{r+1}=\left\{x\in\mathcal{X}_{r}\ \bigg{|}\ \mathrm{% UCB}_{r}(x)\geq\sup_{x^{\prime}\in\mathcal{X}_{r}}\mathrm{LCB}_{r}(x^{\prime})% \right\}.

Here, $\mathrm{UCB}(x)=\mu_{r}(x)+B\sigma_{r}(x)$ and $\mathrm{LCB}(x)=\mu_{r}(x)-B\sigma_{r}(x)$ correspond to upper and lower bounds on the estimate of $f$ . A pseudocode for the algorithm is provided in Algorithm 1.

Algorithm 1 Random Exploration with Domain Shrinking

1: Input:

N_{1}

, the initial batch size.

2: Set

\mathcal{X}_{1}\leftarrow\mathcal{X}

t_{\text{curr}}\leftarrow 0

r\leftarrow 1

3: for

t=t_{\text{curr}}+1,t_{\text{curr}}+2,\dots,t_{\text{curr}}+N_{r}

4: Sample a point

x_{t}

uniformly at random from

\mathcal{X}_{r}

and observe

y_{t}

5: if

t>T

then

6: Terminate

7: end if

8: end for

9: Construct

\mu_{r}

and

\sigma_{r}

based on observations

\{(x_{t},y_{t}:t\in\{t_{\text{curr}}+1,t_{\text{curr}}+2,\dots,N_{r}\}\}

using Eqn (2) and (3) with

\tau=0

10: Set

\mathcal{X}_{r+1}=\{x\in\mathcal{X}_{r}\ |\ \mathrm{UCB}_{r}(x)\geq\sup_{x^{% \prime}\in\mathcal{X}_{r}}\mathrm{LCB}_{r}(x^{\prime})\}

11:

t_{\text{curr}}\leftarrow t_{\text{curr}}+N_{r}

N_{r+1}\leftarrow 2N_{r}

12:

r\leftarrow r+1

4.2 REDS under noisy feedback

The REDS algorithm can be extended to operate under noisy feedback with the following two minor modifications to Algorithm 1. First, the posterior mean and variance $(\mu_{r,\tau},~{}\sigma_{r,\tau}^{2})$ in each epoch should be computed using a noise variance $\tau>0$ (Line $9$ of Algorithm 1). Second, the upper and lower confidence bounds, i.e., UCB and LCB (Line $10$ of Algorithm 1), should be updated to the following:

	$\displaystyle\mathrm{UCB}_{r,\tau,\delta}(x)$	$\displaystyle:=\mu_{r,\tau}(x)+\alpha_{\tau,\delta}\sigma_{r,\tau}(x)+c_{T,% \tau,\delta}$		(5)
	$\displaystyle\mathrm{LCB}_{r,\tau,\delta}(x)$	$\displaystyle:=\mu_{r,\tau}(x)-\alpha_{\tau,\delta}\sigma_{r,\tau}(x)-c_{T,% \tau,\delta},$		(6)

where $\alpha_{\tau,\delta}=B+R\sqrt{(2/\tau)\log(|\mathcal{D}_{T}|/\delta)}$ , $c_{T,\tau,\delta}=\frac{2B}{T}+R\sqrt{\frac{2}{T\tau}\log\left(\frac{4T}{% \delta}\right)}$ and $\mathcal{D}_{T}$ is defined in Assumption 4.1.

4.3 Performance Analysis

For the analysis of the REDS algorithm, we need to make the following two additional assumptions.

Assumption 4.1.

For all $n\in\mathbb{N}$ , there exists a discretization $\mathcal{D}_{n}$ of $\mathcal{X}$ such that for all $f\in\mathcal{H}_{k}$ , $|f(x)-f([x]_{\mathcal{D}_{n}})|\leq\|f\|_{\mathcal{H}_{k}}/n$ and $|\mathcal{D}_{n}|=\text{poly}(n)$ ³³3The notation $f(x)=\mathrm{poly}(x)$ is equivalent to $f(x)=\mathcal{O}(x^{k})$ for some $k\in\mathbb{N}$ ., where $[x]_{\mathcal{D}_{n}}=\operatorname*{arg\,min}_{y\in\mathcal{D}_{n}}\|x-y\|_{2}$ , is the point in $\mathcal{D}_{n}$ that is closest to $x$ .

Assumption 4.2.

Let $\mathcal{L}_{\eta}=\{x\in\mathcal{X}|f(x)\geq\eta\}$ denote the level set of $f$ for $\eta\in[-B,B]$ . We assume that for all $\eta\in[-B,B]$ , $\mathcal{L}_{\eta}$ is a disjoint union of at most $M_{f}<\infty$ components, each of which is closed and connected. Moreover, for each such component, there exists a bi-Lipschitzian map⁴⁴4We refer the reader to the supplementary material for additional details about the terms used in this assumption. between each such component and $\mathcal{X}$ with normalized Lipschitz constant pair $L_{f},L_{f}^{\prime}<\infty$ .

Assumption 4.1 is only required for the noisy case and is a standard assumption adopted in the literature. The existence of such a discretization has been justified and adopted in previous studies Srinivas et al. (2010); Chowdhury and Gopalan (2017); Vakili et al. (2021a); Salgia et al. (2022) and is a mild assumption on the kernel. Specifically, the popular class of kernels like Squared Exponential and Matérn kernels are known to be Lipschitz continuous, in which case a $\varepsilon$ -cover of the domain with $\varepsilon=\mathcal{O}(1/n)$ is sufficient to show the existence of such a discretization. Assumption 4 is an assumption on the regularity of the level sets of the function $f$ . The existence of a bi-Lipschitzian map between two sets implies topological similarity between the two sets. Intuitively, this assumption ensures that the shape of the level-sets is not “too arbitrary”. Note that such an assumption on the level sets of $f$ is relatively mild as the RKHS endows smoothness properties to the function $f$ which translate to a degree of topological regularity of level sets Alberti et al. (2011); Lee (2010).

The following theorem characterizes the regret performance of REDS under noise-free feedback.

Theorem 4.3.

Assume that the kernel $k$ satisfies the polynomial eigendecay condition with parameter $\beta>1$ and function $f$ satisfies Assumption 4. For a given $\delta\in(0,1)$ , if REDS algorithm is run with $N_{1}\geq C_{L_{f},L_{f}^{\prime}}\overline{N}(\delta/\log_{2}(T))$ and noise-free feedback, then the regret incurred by REDS satisfies,

\displaystyle R(T)=\tilde{\mathcal{O}}(\max\{T^{\frac{3-\beta}{2}},1\}).

with probability at least $1-\delta$ . Here, $C_{L_{f},L_{f}^{\prime}}$ is a constant that depends only on $L_{f}$ and $L_{f}^{\prime}$ .

The following is an immediate corollary of the above theorem for the case of Matérn kernels.

Corollary 4.4.

Let $k$ be the Matérn kernel with smoothness $\nu>0$ . For a given $\delta\in(0,1)$ , if REDS algorithm is run with $N_{1}\geq C_{L_{f},L_{f}^{\prime}}\overline{N}(\delta/\log_{2}(T))$ under noise-free feedback on a function $f\in\mathcal{H}_{k}$ satisfying Assumption 4, then the regret incurred by REDS satisfies,

\displaystyle R(T)=\begin{cases}\tilde{\mathcal{O}}(T^{1-\nu/d})&\text{ if }% \nu<d,\\ \mathcal{O}((\log T)^{5/2})&\text{ if }\nu=d,\\ \mathcal{O}((\log T)^{3/2})&\text{ if }\nu>d.\end{cases}.

with probability at least $1-\delta$ . Here, $C_{L_{f},L_{f}^{\prime}}$ is a constant that depends only on $L_{f}$ and $L_{f}^{\prime}$ .

This matches the result conjectured in Vakili (2022) upto logarithmic factors, resolving the open problem.

The following theorem characterizes the regret performance of REDS in the noisy feedback setting.

Theorem 4.5.

Consider the noisy observation model described in Sec. 2.2 and assume that Assumptions 4.1 and 4 hold. For a given $\delta\in(0,1)$ , if REDS algorithm is run with $N_{1}\geq C_{L_{f},L_{f}^{\prime}}\overline{N}(\delta/(2\log_{2}T))$ and UCB and LCB functions as defined in Eqns. (5) and (6) with parameter $\delta^{\prime}=\delta/(2\log_{2}T)$ , then the regret incurred by REDS satisfies,

\displaystyle R(T)=\tilde{\mathcal{O}}(\sqrt{T\gamma_{T}}\log(T/\delta)).

with probability at least $1-\delta$ .

As shown by the above theorem, REDS achieves order-optimal regret (upto logarithmic factors) even under the noisy feedback model.

The proofs of both Theorems 4.3 and 4.5 follow a similar blueprint. A key aspect of both the proofs is to ensure that as Theorem 3.1 is invoked across the sets $\{\mathcal{X}_{r}\}_{r\in\mathbb{N}}$ , the leading constant in Theorem 3.1, which has an implicit dependence on the domain through the constant $F$ , remains bounded and is independent of $T$ . The following lemma shows that for all functions $f$ satisfying Assumption 4, the leading constant only depends on the function and the initial domain.

Lemma 4.6.

Let $f\in\mathcal{H}_{k}$ be such that Assumption 4 holds. Let $\mathcal{X}^{\prime}$ denote a path connected component of any level set of $f$ and $X^{\prime}\subset\mathcal{X}^{\prime}$ be a set of $n$ points drawn uniformly at random from $\mathcal{X}^{\prime}$ . Then for $n\geq C_{{L},{L}_{f}^{\prime}}\overline{N}(\delta)$ , the following relations holds with probability $1-\delta$ :

	$\displaystyle\sup_{x\in\mathcal{X}^{\prime}}\sigma_{X^{\prime},\tau}^{2}(x)$	$\displaystyle\leq C_{{L},{L}_{f}^{\prime}}^{\prime}\cdot F^{2}\tau\cdot\frac{% \gamma_{n,\tau}}{n}$
	$\displaystyle\sup_{x\in\mathcal{X}^{\prime}}\sigma_{X^{\prime},0}^{2}(x)$	$\displaystyle\leq C_{{L},{L}_{f}^{\prime}}^{\prime}\cdot F^{2}\cdot n^{1-\beta}$

where $F$ and $\overline{N}(\delta)$ represent, respectively, the constants in Assumption 2.3 and Theorem 3.1 corresponding to the uniform measure on $\mathcal{X}$ , and $C_{{L},{L}_{f}^{\prime}},C_{{L}_{f},{L}_{f}^{\prime}}^{\prime}$ are constants that depend only on ${L}_{f},{L}_{f}^{\prime}$ .

At a high level, the above lemma ensures that under the regularity condition on the topology of level sets (Assumption 4), Theorem 3.1 can be applied across level sets of $f$ by just paying the penalty of a constant that depends only on $f$ . The proof is based on the inclusion of RKHSs over subsets along with a change of measure argument. We refer the reader to Appendix B for a detailed proof of Lemma 4.6 and Theorems 4.3 and 4.5.

5 Empirical Studies

We compare the computational efficiency of REDS against algorithms with order-optimal regret performance, namely BPE (Li and Scarlett, 2022) and GP-ThreDS (Salgia et al., 2021) through an empirical study. We compare the regret performance and the running time of the three algorithms for three commonly used benchmark functions in Bayesian Optimization, namely, Branin (Azimi et al., 2012; Picheny et al., 2013), Hartmann-4D (Picheny et al., 2013) and Hartmann-6D (Picheny et al., 2013). The analytical expressions for the three benchmark functions are given as follows:

•

Branin function, denoted by $B(x_{1},x_{2})$ , is defined over $\mathcal{X}=[0,1]^{2}$ .

\displaystyle B(x_{1},x_{2})

\displaystyle=-\frac{1}{51.95}\left(\left(v-\frac{5.1u^{2}}{4\pi^{2}}+\frac{5u% }{\pi}-6\right)^{2}+\left(10-\frac{10}{8\pi}\right)\cos(u)-44.81\right),

where $u=15x_{1}-5$ and $v=15x_{2}$ .

•

Hartmann- $4$ D function, denoted by $H_{4}(x_{1},x_{2},x_{3},x_{4})$ , is defined over $\mathcal{X}=[0,1]^{4}$ .

\displaystyle H_{4}(x_{1},x_{2},x_{3},x_{4})=\sum_{i=1}^{4}w_{i}\exp\left(-% \sum_{j=1}^{4}A_{ij}(x_{j}-C_{ij})^{2}\right).

•

Hartmann- $6$ D function, denoted by $H_{6}(x_{1},x_{2},x_{3},x_{4},x_{5},x_{6})$ , is defined over $\mathcal{X}=[0,1]^{6}$ .

\displaystyle H_{6}(x_{1},x_{2},x_{3},x_{4},x_{5},x_{6})=\sum_{i=1}^{4}w_{i}% \exp\left(-\sum_{j=1}^{6}A_{ij}(x_{j}-C_{ij})^{2}\right).

In the definitions above, $w_{i}$ denotes the $i^{\text{th}}$ element of the vector $w=\begin{pmatrix}1.0&1.2&3.0&3.2\end{pmatrix}^{\top}$ and $A_{ij}$ and $C_{ij}$ refer to the $(i,j)^{\text{th}}$ element of the matrices $A$ and $C$ , defined below:

\displaystyle A=\begin{pmatrix}10&3&17&3.5&1.7&8\\ 0.05&10&17&0.1&8&14\\ 3&3.5&1.7&10&17&8\\ 17&8&0.05&10&0.1&14\end{pmatrix};\quad C=10^{-4}\cdot\begin{pmatrix}1312&1696&% 5569&124&8283&5886\\ 2329&4135&8307&3736&1004&9991\\ 2348&1451&3522&2883&3047&6650\\ 4047&8828&8732&5743&1091&381\end{pmatrix}

For BPE and REDS, we consider a discretized version of the domain consisting of $2000$ , $7000$ and $20000$ points chosen uniformly at random from the domain for the Branin, Hartmann- $4$ D and Hartmann- $6$ D functions respectively. We use the exponentially growing epoch schedule for both BPE and REDS as described in (Algorithm 1) for a fair comparison. We implement GP-ThreDS as described in Salgia et al. (2021). For each node in the tree, we consider a discretization, chosen uniformly at random, of size $100$ , $200$ and $500$ for the Branin, Hartmann- $4$ D and Hartmann- $6$ D functions respectively. The values of $(a,b)$ (the lower and upper bound on $f(x^{*})$ ) are set to $(0.5,1.2)$ , $(0,3.8)$ and $(0,3.5)$ for Branin, Hartmann- $4$ D and Hartmann- $6$ D respectively. We set $\tau=0.2$ for all experiments. The value of $\alpha_{\tau}$ is set to $1$ across all experiments, except for BPE with Hartmann- $4$ D and Hartmann- $6$ D for which we set it to $0.75$ . These values are obtained using a grid search over $[0.25,2]$ in steps of $0.25$ . The parameter $N_{1}$ in REDS and BPE was set to $50$ for Branin and $100$ for Hartmann- $4$ D and Hartmann- $6$ D functions.

	BPE	GP-ThreDS	REDS
Branin	$29.84\pm 6.13$	$4.37\pm 0.28$	$\mathbf{0.32}\pm 0.08$
Hartmann-4D	$38.45\pm 3.93$	$7.59\pm 0.54$	$\mathbf{0.47}\pm 0.11$
Hartmann-6D	$119.71\pm 23.75$	$19.33\pm 0.54$	$\mathbf{1.19}\pm 0.08$

Table 1: Time taken (in seconds) by different algorithms across the different benchmark functions.

For all the experiments, we used the Square exponential kernel. The length scale was set to $0.2$ for Branin and $1$ for Hartmann- $4$ D and Hartmann- $6$ D functions. We corrupted the observations with a zero mean Gaussian noise to the with a standard deviation of $0.2$ . All the algorithms were run for $T=1000$ time steps. We recorded the cumulative regret and time taken by different algorithms for $10$ Monte Carlo runs for each benchmark function.

The regret for the algorithms over different functions is plotted in Figure 1. The shaded region represents the error bars upto standard deviation on either side. The running times, with an error bar of one standard deviation, are tabulated in Table 1. As evident from the plots in Figure 1, the regret incurred by REDS is comparable to that of other algorithms for all benchmark functions. At the same time, REDS offers about a $15\times$ and $100\times$ speedup in terms of runtime over the GP-ThreDS and BPE (See Table 1), demonstrating the practical benefits of our proposed methodology of random sampling.

6 Conclusion

In this work, we studied the methodology of exploring the domain using random samples drawn from a distribution supported on a compact domain. We showed that this non-adaptive approach offers the optimal-order of worst case predictive error for RKHS function in both noisy and noise-free feedback settings. The proposed approach offers a simple alternative for designing Bayesian Optimization algorithms which typically involve choosing points through a computationally expensive step of optimizing a non-convex acquisition function. Based on this methodology, we developed a algorithm that achieves order-optimal regret in both noisy and noise-free settings, resolving a COLT open problem. We demonstrated the computational advantage of the proposed approach through an empirical study, where the proposed algorithm achieved upto a $100\times$ runtime speed up over state-of-the-art algorithms.

References

Alberti et al. (2011) G. Alberti, S. Bianchini, and G. Crippa. Structure of level sets and sard-type properties of lipschitz maps. Annali della Scuola Normale Superiore di Pisa. Classe di Scienze. Serie V, 4, 08 2011. doi: 10.2422/2036-2145.201107_006.
Arcangéli et al. (2012) R. Arcangéli, M. C. López de Silanes, and J. J. Torrens. Extension of sampling inequalities to Sobolev semi-norms of fractional order and derivative data. Numerische Mathematik, 121(3):587–608, 2012. ISSN 0029599X. doi: 10.1007/s00211-011-0439-3.
Azimi et al. (2012) J. Azimi, A. Jalali, and X. Z. Fern. Hybrid batch bayesian optimization. In Proceedings of the 29th International Conference on Machine Learning, ICML, volume 2, pages 1215–1222, 2012. ISBN 9781450312851.
Bastian Bohn (2017) Bastian Bohn. Error analysis of regularized and unregularized least-squares regression on discretized function spaces. PhD thesis, Rheinische Friedrich-Wilhelms-Universität Bonn, 2017. URL https://hdl.handle.net/20.500.11811/7094.
Bohn (2018) B. Bohn. On the convergence rate of sparse grid least squares regression. In Sparse Grids and Applications, pages 19–41. Springer International Publishing, 2018. ISBN 978-3-319-75426-0.
Bohn and Griebel (2017) B. Bohn and M. Griebel. Error estimates for multivariate regression on discretized function spaces. SIAM Journal on Numerical Analysis, 55(4):1843–1866, 2017.
Brenner et al. (2008) S. C. Brenner, L. R. Scott, and L. R. Scott. The mathematical theory of finite element methods, volume 3. Springer, 2008.
Brochu et al. (2010) E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning, 2010.
Buchholz (2001) A. Buchholz. Operator khintchine inequality in non-commutative probability. Mathematische Annalen, 319(1):1–16, 2001.
Buchholz (2005) A. Buchholz. Optimal constants in khintchine type inequalities for fermions, rademachers and q-gaussian operators. Bulletin of The Polish Academy of Sciences Mathematics, 53:315–321, 2005. URL https://api.semanticscholar.org/CorpusID:55683104.
Bull (2011) A. D. Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12:2879–2904, 2011. ISSN 15324435.
Camilleri et al. (2021) R. Camilleri, J. Katz-Samuels, and K. Jamieson. High-Dimensional Experimental Design and Kernel Bandits. In Proceedings of the 38th International Conference on Machine Learning, ICML, 2021. URL https://arxiv.longhoe.net/abs/2105.05806v1http://arxiv.longhoe.net/abs/2105.05806.
Chatterji et al. (2019) N. Chatterji, A. Pacchiano, and P. Bartlett. Online learning with kernel losses. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 971–980. PMLR, 2019.
Chkifa et al. (2015) A. Chkifa, A. Cohen, G. Migliorati, F. Nobile, and R. Tempone. Discrete least squares polynomial approximation with random evaluations- application to parametric and stochastic elliptic pdes. ESAIM: Mathematical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique, 49(3):815–837, 2015.
Chowdhury and Gopalan (2017) S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning, ICML, volume 2, pages 1397–1422, 2017. ISBN 9781510855144.
Cohen and Migliorati (2017) A. Cohen and G. Migliorati. Optimal weighted least-squares methods. The SIAM journal of computational mathematics, 3:181–203, 2017.
Cohen et al. (2013) A. Cohen, M. Davenport, and D. Leviatan. On the stability and accuracy of least squares approximations. Foundations of Computational Mathematics, 13:819–834, 2013.
De Freitas et al. (2012) N. De Freitas, A. J. Smola, and M. Zoghi. Exponential regret bounds for Gaussian process bandits with deterministic observations. In Proceedings of the 29th International Conference on Machine Learning, ICML, volume 2, pages 1743–1750, 2012. ISBN 9781450312851.
Dunkl and Xu (2014) C. F. Dunkl and Y. Xu. Orthogonal Polynomials of Several Variables. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 2 edition, 2014. doi: 10.1017/CBO9781107786134.
Frazier et al. (2008) P. I. Frazier, W. B. Powell, and S. Dayanik. A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization, 47(5):2410–2439, 2008.
Greenhill et al. (2020) S. Greenhill, S. Rana, S. Gupta, P. Vellanki, and S. Venkatesh. Bayesian optimization for adaptive experimental design: A review. IEEE access, 8:13937–13948, 2020.
Gröchenig (2020) K. Gröchenig. Sampling, marcinkiewicz–zygmund inequalities, approximation, and quadrature rules. Journal of Approximation Theory, 257:105455, 2020.
Grünewälder et al. (2010) S. Grünewälder, J.-Y. Audibert, M. Opper, and J. Shawe-Taylor. Regret bounds for gaussian process bandit problems. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 273–280, 2010.
Jones (2001) D. R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of global optimization, 21:345–383, 2001.
Jones et al. (1998) D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13:455–492, 1998.
Kämmerer et al. (2021) L. Kämmerer, T. Ullrich, and T. Volkmer. Worst-case recovery guarantees for least squares approximation using random samples. Constructive Approximation, 54(2):295–352, 2021.
Kanagawa et al. (2018) M. Kanagawa, P. Hennig, D. Sejdinovic, and B. K. Sriperumbudur. Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences, 2018.
Krieg and Ullrich (2021a) D. Krieg and M. Ullrich. Function values are enough for l2-approximation. Foundations of Computational Mathematics, 21:1141–1151, 2021a. doi: https://doi.org/10.1007/s10208-020-09481-w.
Krieg and Ullrich (2021b) D. Krieg and M. Ullrich. Function values are enough for l2-approximation: Part ii. Journal of Complexity, 66, 2021b. ISSN 0885-064X. doi: https://doi.org/10.1016/j.jco.2021.101569.
Kushner (1964) H. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86:97–106, 1964.
Lee (2010) J. Lee. Introduction to Topological Manifolds. Springer, 2010.
Li et al. (2016) L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization, 2016.
Li and Scarlett (2022) Z. Li and J. Scarlett. Gaussian process bandit optimization with few batches. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, AISTATS, 2022.
Lizotte et al. (2007) D. J. Lizotte, T. Wang, M. H. Bowling, and D. Schuurmans. Automatic gait optimization with gaussian process regression. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI), volume 7, pages 944–949, 2007.
Lyu et al. (2020) Y. Lyu, Y. Yuan, and I. W. Tsang. Efficient batch black-box optimization with deterministic regret bounds, 2020.
Močkus (1975) J. Močkus. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pages 400–404, Berlin, Heidelberg, 1975. Springer Berlin Heidelberg. ISBN 978-3-540-37497-8.
Moeller and Ullrich (2021) M. Moeller and T. Ullrich. L 2-norm sampling discretization and recovery of functions from rkhs with finite trace. Sampling Theory, Signal Processing, and Data Analysis, 19(2):13, 2021.
Močkus et al. (1978) J. Močkus, V. Tiesis, and A. Žilinskas. Towards Global Optimization, volume 2, chapter The application of Bayesian methods for seeking the extremum, pages 117–129. Elsevier, 09 1978. ISBN 0-444-85171-2.
Narcowich et al. (2006) F. J. Narcowich, J. D. Ward, and H. Wendland. Sobolev error estimates and a bernstein inequality for scattered data interpolation via radial basis functions. Constructive Approximation, 24:175–186, 2006.
Ostrowski (1959) A. M. Ostrowski. A quantitative formulation of slyvester’s law of inertia. Proceedings of the National Academy of Sciences, 45(5):740–744, 1959. doi: 10.1073/pnas.45.5.740. URL https://www.pnas.org/doi/abs/10.1073/pnas.45.5.740.
Picheny et al. (2013) V. Picheny, T. Wagner, and D. Ginsbourger. A benchmark of kriging-based infill criteria for noisy optimization. Structural and Multidisciplinary Optimization, 48(3):607–626, 2013. ISSN 1615147X. doi: 10.1007/s00158-013-0919-4. URL https://link.springer.com/article/10.1007/s00158-013-0919-4.
Riutort-Mayol et al. (2023) G. Riutort-Mayol, P.-C. Bürkner, M. R. Andersen, A. Solin, and A. Vehtari. Practical hilbert space approximate bayesian gaussian processes for probabilistic programming. Statistics and Computing, 33(1):17, 2023.
Rudin (1987) W. Rudin. Real and complex analysis, 3rd ed. McGraw-Hill, Inc., USA, 1987. ISBN 0070542341.
Salgia et al. (2021) S. Salgia, S. Vakili, and Q. Zhao. A domain-shrinking based Bayesian optimization algorithm with order-optimal regret performance. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems, volume 34, 2021.
Salgia et al. (2022) S. Salgia, S. Vakili, and Q. Zhao. Collaborative Learning in Kernel-based Bandits for Distributed Users, 2022.
Scarlett et al. (2017) J. Scarlett, I. Bogunovic, and V. Cehver. Lower Bounds on Regret for Noisy Gaussian Process Bandit Optimization. In Conference on Learning Theory, volume 65, pages 1–20, 2017.
Smale and Zhou (2004) S. Smale and D.-X. Zhou. Shannon sampling and function reconstruction from point values. Bulletin of The American Mathematical Society, 41:279–306, 2004. doi: 10.1090/S0273-0979-04-01025-0.
Srinivas et al. (2010) N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, ICML, pages 1015–1022, 2010. ISBN 9781605589077. doi: 10.1109/TIT.2011.2182033.
Steinwart and Christmann (2008) I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008. doi: https://doi.org/10.1007/978-0-387-77242-4.
Törn and Žilinskas (1989) A. Törn and A. Žilinskas. Global Optimization. Springer Berlin, Heidelberg, 1989.
Tropp (2012) J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12:389–434, 2012.
Tuo and Wang (2020) R. Tuo and W. Wang. Kriging prediction with isotropic matérn correlations: Robustness and experimental designs. The Journal of Machine Learning Research, 21(1):7604–7641, 2020.
Vakili (2022) S. Vakili. Open problem: Regret bounds for noise-free kernel-based bandits. In Proceedings of 35th Conference on Learning Theory (COLT), volume 178, pages 5624–5629, 2022.
Vakili et al. (2021a) S. Vakili, N. Bouziani, S. Jalali, A. Bernacchia, and D.-s. Shiu. Optimal order simple regret for Gaussian process bandits. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems, 2021a.
Vakili et al. (2021b) S. Vakili, K. Khezeli, and V. Picheny. On information gain and regret bounds in Gaussian process bandits. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, AISTATS, 2021b.
Valko et al. (2013) M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, UAI, pages 654–663, 2013.
Vanchinathan et al. (2014) H. P. Vanchinathan, I. Nikolic, F. De Bona, and A. Krause. Explore-exploit in top-n recommender systems via gaussian processes. In Proceedings of the 8th ACM Conference on Recommender Systems, pages 225–232, 2014.
Vazquez and Bect (2010) E. Vazquez and J. Bect. Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. Journal of Statistical Planning and Inference, 140(11):3088–3095, 2010. ISSN 0378-3758. doi: https://doi.org/10.1016/j.jspi.2010.04.018.
Wasserman (2008) L. Wasserman. Lecture notes on statistical methods for machine learning, 2008. URL https://www.stat.cmu.edu/~larry/=sml/Concentration.pdf.
Wendland (2004) H. Wendland. Scattered Data Approximation. Cambridge University Press, 2004. doi: 10.1017/CBO9780511617539.
Wenzel et al. (2021) T. Wenzel, G. Santin, and B. Haasdonk. A novel class of stabilized greedy kernel approximation algorithms: Convergence, stability and uniform point distribution. Journal of Approximation Theory, 262, 2021. ISSN 10960430. doi: 10.1016/j.jat.2020.105508.
Wynne et al. (2021) G. Wynne, F.-X. Briol, and M. Girolami. Convergence guarantees for gaussian process means with misspecified likelihoods and smoothness. The Journal of Machine Learning Research, 22(1):5468–5507, 2021.

Appendix A Proof of Theorem 3.1

We begin with setting up some notation that will be used throughout the proof. Throughout the appendix, we will represent the elements of $\mathcal{H}_{k}$ as infinite dimensional vectors and operators over these function spaces as infinite dimensional matrices. We adopt such a convention for ease for presentation while kee** in mind that despite the matrix representation, the actual operation is over elements of $\mathcal{H}_{k}$ . Recall that we defined the sample covariance operator $\hat{\mathbf{Z}}$ for a randomly chosen sample $X_{n}=\{x_{1},x_{2},\dots,x_{n}\}$ and its expected value $\mathbf{Z}=\mathbb{E}[\hat{\mathbf{Z}}]$ as follows for any $g\in\mathcal{H}_{k}$ :

	$\displaystyle\hat{\mathbf{Z}}g$	$\displaystyle:=\left[\sum_{i=1}^{n}\langle g,\psi_{x_{i}}\rangle\psi_{x_{i}}% \right]+\tau g$
	$\displaystyle\mathbf{Z}$	$\displaystyle:=\mathbb{E}[\hat{\mathbf{Z}}].$

In the matrix-vector notation, the operators (equivalently, matrices) are given as:

	$\displaystyle\hat{\mathbf{Z}}$	$\displaystyle:=\left(\sum_{i=1}^{n}\psi_{x_{i}}\psi_{x_{i}}^{\top}\right)+\tau% \mathbf{Id}$
	$\displaystyle\mathbf{Z}$	$\displaystyle=\mathbb{E}[\hat{\mathbf{Z}}]=\mathbb{E}\left[\sum_{i=1}^{n}\psi_% {x_{i}}\psi_{x_{i}}^{\top}\right]+\tau\mathbf{Id}$
		$\displaystyle=n\mathbb{E}[\psi_{x_{1}}\psi_{x_{1}}^{\top}]+\tau\mathbf{Id}=n% \boldsymbol{\Lambda}+\tau\mathbf{Id},$

where $\mathbf{Id}$ is the identity matrix (operator) and $\boldsymbol{\Lambda}=\text{diag}(\lambda_{1},\lambda_{2},\dots)$ is the diagonal matrices consisting of the eigenvalues of the kernel $k$ corresponding to the measure $\varrho$ . If we define $\boldsymbol{\Psi}_{n}:=[\psi_{x_{1}},\psi_{x_{2}},\dots,\psi_{x_{n}}]$ , then we can also write $\hat{\mathbf{Z}}=\boldsymbol{\Psi}_{n}\boldsymbol{\Psi}_{n}^{\top}+\tau\mathbf% {Id}$ . Consequently, the posterior variance at any point $x\in\mathcal{X}$ is given as:

\displaystyle\sigma_{n,\tau}^{2}(x)=\tau\psi_{x}^{\top}\hat{\mathbf{Z}}^{-1}% \psi_{x}.

For any $R\in\mathbb{N}$ , we define the following two quantities that will be relevant during our analysis:

	$\displaystyle N(R)$	$\displaystyle:=\sup_{x\in\mathcal{X}}\sum_{j=1}^{R}\varphi_{j}^{2}(x),$		(7)
	$\displaystyle T(R)$	$\displaystyle:=\sup_{x\in\mathcal{X}}\sum_{j=R+1}^{\infty}\lambda_{j}\varphi_{% j}^{2}(x)=\sup_{x\in\mathcal{X}}\sum_{j=R+1}^{\infty}\upsilon_{j}^{2}(x).$		(8)

Recall that $\{\varphi_{j}\}_{j\in\mathbb{N}}$ are eigenfunctions of the kernel operator and form an orthonormal system in $L_{2}(\varrho,\mathcal{X})$ and $\{\upsilon\}_{j}$ are an orthonormal basis for $\mathcal{H}_{k}$ . The term $N(R)$ is often referred to as the spectral function (see Gröchenig (2020) and references therein) and in case of orthogonal polynomials, it is the inverse of the infimum of the Christoffel function Dunkl and Xu (2014). Both $N(R)$ and $T(R)$ are fundamental quantities that appear in the analysis of reconstruction and estimation of functions.

Lastly, based on $N(R)$ and $T(R)$ , for a given kernel $k$ , measure $\varrho$ and $\delta\in(0,1)$ , we define the following terms for any $n\in\mathbb{N}$ and $\tau>0$ :

	$\displaystyle\mathcal{R}_{k,\varrho}^{(1)}(n,\tau,\delta)$	$\displaystyle:=\left\{R\in\mathbb{N}:N(R)\leq\frac{n}{1944\log(6n/\delta)}\right\}$
	$\displaystyle\mathcal{R}_{k,\varrho}^{(2)}(n,\tau,\delta)$	$\displaystyle:=\left\{R\in\mathbb{N}:\max\{42T(R),n\lambda_{R+1}\}\log\left(% \frac{12}{\delta}\right)\leq\frac{\tau}{27}\right\}$
	$\displaystyle\mathcal{R}_{k,\varrho}(n,\tau,\delta)$	$\displaystyle:=\mathcal{R}_{k,\varrho}^{(1)}(n,\tau,\delta)\cap\mathcal{R}_{k,% \varrho}^{(2)}(n,\tau,\delta)$
	$\displaystyle\overline{N}(k,\varrho,\delta,\tau)$	$\displaystyle:=\max\left\{\min\left\{n:\mathcal{R}_{k,\varrho}(n,\tau,\delta)% \neq\emptyset\right\},\lceil 729\cdot F^{4}\cdot\log(12/\delta)\rceil\right\}$

The dependence on $k$ and $\varrho$ is implicit through $\{\varphi_{j}\}_{j\in\mathbb{N}}$ and $\{\lambda_{j}\}_{j\in\mathbb{N}}$ used to define $N(R)$ and $T(R)$ . For brevity of notation, going forward, we drop the explicit description of dependence on $k$ and $\varrho$ .

We are now ready to prove the theorem. We first prove the statement of the theorem, assuming that the lemmas hold, followed by the proofs of the lemmas.

We begin with result for the noisy case, where $\tau>0$ is fixed (independent of $n$ ). From Lemma 3.2, we know that for $n\geq\overline{N}$ , $\|\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id}\|_{2}\leq 1/9$ holds with probability $1-\delta$ . Using this result along Lemma 3.3, we can conclude that $\psi_{x}^{\top}\hat{\mathbf{Z}}^{-1}\psi_{x}\leq 2\psi_{x}^{\top}{\mathbf{Z}}^% {-1}\psi_{x}$ holds for all $x$ . Thus, we have,

$\displaystyle\sigma_{n,\tau}^{2}(x)$	$\displaystyle=\tau\psi_{x}^{\top}\hat{\mathbf{Z}}^{-1}\psi_{x}$
	$\displaystyle\leq 2\tau\psi_{x}^{\top}{\mathbf{Z}}^{-1}\psi_{x}$
	$\displaystyle\leq\frac{108F^{2}}{13}\cdot\tau\cdot\frac{\tilde{\gamma}_{X_{n},% \tau}}{n}$
	$\displaystyle\leq\frac{108F^{2}}{13}\cdot\tau\cdot\frac{\gamma_{n,\tau}}{n},$	(9)

as required. The third line in the above expression follows from Lemma 3.4. We would like to emphasize that the polynomial eigendecay condition is not necessary to obtain the above relation. It is only necessary to bound the information gain in terms on $n$ . Under the polynomial eigendecay condition with parameter $\beta>1$ , the above equation can also be written as

\displaystyle\sigma_{n,\tau}^{2}(x)\leq C_{0}\cdot\left(\frac{n}{\tau}\right)^% {\frac{1}{\beta}-1}\log(n),

where we used the bound on information gain from Vakili et al. (2021b, Corollary 1) and $C_{0}$ is an appropriately chosen constant independent of $n$ and $\tau$ .

We now consider the noise-free case. Since information gain is only defined for $\tau>0$ , we cannot directly extend the analysis as used in the noisy case by substituting $\tau=0$ . To circumvent this issue, we carefully choose $\tau^{*}>0$ , such that $\sigma_{n,\tau^{*}}^{2}$ is a close representation of $\sigma_{n,0}^{2}$ . We choose $\tau^{*}$ to be dependent on $n$ such that $\tau^{*}$ goes to $0$ as $n$ becomes larger. This allows $\sigma_{n,\tau^{*}}^{2}$ to faithfully represent the value of $\sigma_{n,0}^{2}$ over the range of $n$ . Specifically, we choose $\tau^{*}=c^{\prime}n^{1-\beta}(\log(n/\delta))^{\beta}$ for $c^{\prime}\geq C(1944F^{2})^{\beta}$ , where $C$ is the constant in Assumption 2.3. The condition on constant $c^{\prime}$ ensures that $\overline{N}(k,\varrho,\delta,\tau^{*})$ exists. Since all conditions of the analysis for $\tau>0$ (noisy case) are satisfied, we can directly invoke the result for $\tau>0$ . Using the bound on $\sigma_{n,\tau}^{2}$ and the monotonicity of $\sigma_{n,\tau}^{2}$ as a function of $\tau$ , we obtain,

\displaystyle\sigma_{n,0}^{2}(x)\leq\sigma_{n,\tau^{*}}^{2}(x)\leq C_{1}\cdot n% ^{1-\beta}(\log(n/\delta))^{\beta},

where $C_{1}$ is a constant independent of $n$ .

In the following subsections, we prove Lemmas 3.2, 3.3 and 3.4.

A.1 Proof of Lemma 3.2

Since we are interested in bounding the $2$ -norm of the operator $\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id}$ , we will focus on finding an upper bound on $g^{\top}(\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})g$ that holds uniformly for all functions $g$ in the unit ball in RKHS, i.e., $\{g:\|g\|_{\mathcal{H}_{k}}\leq 1\}$ . The high level idea is to separately consider the contribution of component of $g$ that belongs to the subspace spanned by eigenfunctions corresponding to the “large” eigenvalues, i.e., head of the spectrum and those corresponding to the “small” eigenvalues, i.e., tail of the spectrum.

Throughout the proof, we fix a $R\in\mathcal{R}_{n,\tau}$ . The existence of such an $R$ is guaranteed by the assumption $n>\overline{N}$ . For the analysis, we define two projection operators, $\mathbf{P}$ and $\mathbf{Q}$ . We define $\mathbf{P}$ as the projection operator onto the subspace spanned by $\{\upsilon_{j}\}_{j=1}^{R}$ , i.e., for any $g=\sum_{j\in\mathbb{N}}g_{j}\upsilon_{j}\in\mathcal{H}_{k}$ , $\mathbf{P}g=\sum_{j=1}^{R}g_{j}\upsilon_{j}$ . Note that $\mathbf{P}$ is an orthogonal projection operator. Similarly, we define $\mathbf{Q}=\text{Id}-\mathbf{P}$ .

We also introduce some additional notation for the ease of presentation. We define $\mathbf{L}$ to be the diagonal matrix (operator) whose $j^{\text{th}}$ entry is $\dfrac{\lambda_{j}}{n\lambda_{j}+\tau}$ . Similarly, let $\omega_{i}=\boldsymbol{\Lambda}^{-1/2}\psi_{x_{i}}$ for $i=1,2,\dots,n$ . Using this notation, we can rewrite the matrix $\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id}$ as

	$\displaystyle\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id}$	$\displaystyle=\mathbf{Z}^{-1/2}\left(\sum_{i=1}^{n}\psi_{x_{i}}\psi_{x_{i}}^{% \top}+\tau\mathbf{Id}\right)\mathbf{Z}^{-1/2}-\mathbf{Id}$
		$\displaystyle=\sum_{i=1}^{n}(\mathbf{Z}^{-1/2}\psi_{x_{i}})(\mathbf{Z}^{-1/2}% \psi_{x_{i}})^{\top}+\tau\mathbf{Z}^{-1}-\mathbf{Id}$
		$\displaystyle=\sum_{i=1}^{n}(\mathbf{L}^{1/2}\omega_{i})(\mathbf{L}^{1/2}% \omega_{i})^{\top}-n\mathbf{L}.$

For any $g\in\mathcal{H}_{k}$ , we have the following decomposition:

$\displaystyle\|g^{\top}(\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-% \mathbf{Id})g\|$	$\displaystyle=\|(\mathbf{P}g+\mathbf{Q}g)^{\top}(\mathbf{Z}^{-1/2}\hat{\mathbf{% Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})(\mathbf{P}g+\mathbf{Q}g)\|$
	$\displaystyle\leq\|(\mathbf{P}g)^{\top}(\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}% \mathbf{Z}^{-1/2}-\mathbf{Id})(\mathbf{P}g)\|+\|(\mathbf{Q}g)^{\top}(\mathbf{Z}^% {-1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})(\mathbf{Q}g)\|$
	$\displaystyle~{}~{}~{}~{}~{}~{}+\|(\mathbf{P}g)^{\top}(\mathbf{Z}^{-1/2}\hat{% \mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})+(\mathbf{Q}g)^{\top}(\mathbf{Z}^{-1/% 2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})(\mathbf{P}g)\|$
	$\displaystyle\leq\underbrace{\|g^{\top}\mathbf{P}(\mathbf{Z}^{-1/2}\hat{\mathbf% {Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})\mathbf{P}g\|}_{:=E_{1}}+\underbrace{\|g^{\top% }\mathbf{Q}(\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})% \mathbf{Q}g\|}_{:=E_{2}}$
	$\displaystyle~{}~{}~{}~{}~{}~{}+2\underbrace{\|g^{\top}\mathbf{P}(\mathbf{Z}^{-% 1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})\mathbf{Q}g\|}_{:=E_{3}}.$	(10)

We separately bound the terms $E_{1},E_{2}$ and $E_{3}$ , beginning we $E_{1}$ . We have,

$\displaystyle E_{1}$	$\displaystyle=\|g^{\top}\mathbf{P}(\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^% {-1/2}-\mathbf{Id})\mathbf{P}g\|$
	$\displaystyle=\left\|(\mathbf{P}g)^{\top}\mathbf{P}\left(\sum_{i=1}^{n}(\mathbf% {L}^{1/2}\omega_{i})(\mathbf{L}^{1/2}\omega_{i})^{\top}-nL\mathbf{L}\right)% \mathbf{P}(\mathbf{P}g)\right\|$
	$\displaystyle=\left\|(\mathbf{P}g)^{\top}\left(\sum_{i=1}^{n}(\mathbf{P}\mathbf% {L}^{1/2}\mathbf{P}\omega_{i})(\mathbf{P}\mathbf{L}^{1/2}\mathbf{P}\omega_{i})% ^{\top}-n\mathbf{P}\mathbf{L}\mathbf{P}\right)(\mathbf{P}g)\right\|$
	$\displaystyle=n\left\|(\mathbf{P}g)^{\top}\mathbf{P}\mathbf{L}^{1/2}\mathbf{P}% \left(\frac{1}{n}\sum_{i=1}^{n}(\mathbf{P}\omega_{i})(\mathbf{P}\omega_{i})^{% \top}-\mathbf{P}\right)\mathbf{P}\mathbf{L}^{1/2}\mathbf{P}(\mathbf{P}g)\right\|$
	$\displaystyle\leq n\left\\|\left(\frac{1}{n}\sum_{i=1}^{n}(\mathbf{P}\omega_{i}% )(\mathbf{P}\omega_{i})^{\top}-\mathbf{P}\right)\right\\|_{2}\cdot\\|\mathbf{P}% \mathbf{L}^{1/2}\mathbf{P}(\mathbf{P}g)\\|_{\mathcal{H}_{k}}^{2}$
	$\displaystyle\leq n\left\\|\left(\frac{1}{n}\sum_{i=1}^{n}(\mathbf{P}\omega_{i}% )(\mathbf{P}\omega_{i})^{\top}-\mathbf{P}\right)\right\\|_{2}\cdot(g^{\top}% \mathbf{P}\mathbf{L}\mathbf{P}g)$
	$\displaystyle\leq\left\\|\left(\frac{1}{n}\sum_{i=1}^{n}(\mathbf{P}\omega_{i})(% \mathbf{P}\omega_{i})^{\top}-\mathbf{P}\right)\right\\|_{2}\cdot(n\\|\mathbf{L}% \\|_{2})\cdot\\|\mathbf{P}g\\|_{\mathcal{H}_{k}}^{2}.$	(11)

In the above equations, we used the fact that for any diagonal matrix $D$ , $\mathbf{P}D=D\mathbf{P}=\mathbf{P}D\mathbf{P}$ and that $\mathbf{P}^{2}=\mathbf{P}$ . Firstly, note that $\|\mathbf{L}\|_{2}=\max_{j\in\mathbb{N}}\lambda_{j}/(n\lambda_{j}+\tau)\leq 1/n$ . Consequently, $n\|\mathbf{L}\|_{2}\leq 1$ . Secondly, to bound the first term on the RHS, we denote $\mathbf{P}\omega_{i}:=A_{i}$ for all $i=1,2,\dots,n$ . We have, $\mathbb{E}[A_{i}A_{i}^{\top}]=\mathbf{P}\mathbb{E}[\omega_{i}\omega_{i}^{\top}% ]\mathbf{P}=\mathbf{P}\boldsymbol{\Lambda}^{-1/2}\mathbb{E}[\psi_{x_{i}}\psi_{% x_{i}}^{T}]\boldsymbol{\Lambda}^{-1/2}\mathbf{P}=\mathbf{P}\boldsymbol{\Lambda% }^{-1/2}\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{-1/2}\mathbf{P}=\mathbf{P}$ . Moreover, for all $A_{i}$ ’s, only the top $R\times R$ sub-matrix has non-zero entries, implying it is sufficient to bound the $2$ -norm of that finite sub-matrix to bound the first term on the RHS. We use Matrix-Chernoff inequality (Tropp, 2012, Theorem 1.1) to bound the $2$ -norm of this finite dimensional submatrix.

For all $i=1,2,\dots,n$ , let $[A_{i}]_{R}\in\mathbb{R}^{R}$ denote the $R$ -dimensional vector corresponding to the first $R$ coordinates of $A_{i}$ . Thus, we are interested in applying the Matrix-Chernoff inequality to bound the following expression:

\displaystyle E_{11}:=\left\|\left(\frac{1}{n}\sum_{i=1}^{n}[A_{i}]_{R}[A_{i}]% _{R}^{\top}-I_{R}\right)\right\|_{2},

where $I_{R}$ denotes the $R$ dimensional identity matrix. Here, we used the fact that the relevant $R\times R$ sub-matrix of $\mathbf{P}$ , or equivalently $\mathbb{E}[[A_{1}]_{R}[A_{1}]_{R}^{\top}]$ , corresponds to $I_{R}$ . To invoke the Matrix-Chernoff inequality, we need bounds on the maximum and minimum eigenvalue of $\displaystyle\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}[A_{i}]_{R}[A_{i}]_{R}^{% \top}\right]$ and a bound on $\|[A_{i}]_{R}[A_{i}]_{R}^{\top}/n\|_{2}$ that holds almost surely for all $i=1,2,\dots,n$ . Since $\mathbb{E}[[A_{1}]_{R}[A_{1}]_{R}^{\top}]=I_{R}$ , $\displaystyle\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}[A_{i}]_{R}[A_{i}]_{R}^{% \top}\right]=I_{R}$ implying that both the maximum and minimum eigenvalues are $1$ . For any $i=1,2,\dots,n$ , we have,

\displaystyle\frac{\|[A_{i}]_{R}[A_{i}]_{R}^{\top}\|_{2}}{n}

\displaystyle\leq\frac{1}{n}\mathrm{trace}([A_{i}]_{R}[A_{i}]_{R}^{\top})\leq% \frac{1}{n}\mathrm{trace}([A_{i}]_{R}^{\top}[A_{i}]_{R})\leq\frac{1}{n}\|% \mathbf{P}\omega_{i}\|_{\mathcal{H}_{k}}^{2}\leq\frac{1}{n}\sum_{j=1}^{R}% \varphi_{j}^{2}(x_{i})\leq\frac{N(R)}{n}.

On invoking the Matrix-Chernoff inequality with these results, we obtain that the following relation is true with probability $1-\delta/6$ :

\displaystyle E_{11}\leq\sqrt{\frac{3N(R)\log(3R/\delta)}{n}}.

(12)

On combining the above bound with Eqn. (11) along with noting that $n\|L\|_{2}\leq 1$ , we can conclude that:

\displaystyle E_{1}\leq\sqrt{\frac{3N(R)\log(3R/\delta)}{n}}\cdot\|\mathbf{P}g% \|_{\mathcal{H}_{k}}^{2}.

(13)

We would like to mention that the above bound is only valid when the RHS in Eqn. (12) is less than $1$ . However, this condition is satisfied by the choice of $n>\overline{N}$ .

We now consider the second term, $E_{2}$ . We have,

$\displaystyle E_{2}$	$\displaystyle=\|g^{\top}\mathbf{Q}(\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^% {-1/2}-\mathbf{Id})\mathbf{Q}g\|$
	$\displaystyle=\left\|(\mathbf{Q}g)^{\top}\left(\sum_{i=1}^{n}(\mathbf{Q}\mathbf% {L}^{1/2}\omega_{i})(\mathbf{Q}\mathbf{L}^{1/2}\omega_{i})^{\top}-n\mathbf{Q}% \mathbf{L}\mathbf{Q}\right)(\mathbf{Q}g)\right\|$	(14)
	$\displaystyle\leq n\underbrace{\left\\|\left(\frac{1}{n}\sum_{i=1}^{n}(\mathbf{% Q}\mathbf{L}^{1/2}\omega_{i})(\mathbf{Q}\mathbf{L}^{1/2}\omega_{i})^{\top}-% \mathbf{Q}\mathbf{L}\mathbf{Q}\right)\right\\|_{2}}_{:=E_{21}}\cdot\\|\mathbf{Q}% g\\|_{\mathcal{H}_{k}}^{2}.$	(15)

Note that the term $E_{21}$ has a similar structure as $E_{11}$ except for the fact that $E_{21}$ involves infinite-dimensional vectors as opposed to finite-dimensional vectors. Thus, to bound $E_{21}$ we use a result from Moeller and Ullrich (2021, Proposition 3.8) which is spectral concentration inequality for infinite-dimensional vectors derived using non-commutative Khinchtine inequality Buchholz (2001, 2005); Moeller and Ullrich (2021). From Proposition $3.8$ in Moeller and Ullrich (2021), we can conclude that the following relation holds with probability at least $1-\delta/6$ :

\displaystyle\left\|\left(\frac{1}{n}\sum_{i=1}^{n}(\mathbf{Q}\mathbf{L}^{1/2}% \omega_{i})(\mathbf{Q}\mathbf{L}^{1/2}\omega_{i})^{\top}-\mathbf{Q}\mathbf{L}% \mathbf{Q}\right)\right\|_{2}\leq\max\left\{\frac{42}{n}\log\left(\frac{12}{% \delta}\right)B_{1},B_{2}\right\},

(16)

where $B_{1}=\max_{i=1,2,\dots,n}\|\mathbf{Q}\mathbf{L}^{1/2}\omega_{i}\|_{\mathcal{H% }_{k}}^{2}$ and $B_{2}=\|\mathbf{Q}\mathbf{L}\mathbf{Q}\|_{2}$ . We can further bound the terms $B_{1}$ and $B_{2}$ as follows.

	$\displaystyle B_{1}$	$\displaystyle=\max_{i=1,2,\dots,n}\\|\mathbf{Q}\mathbf{L}^{1/2}\omega_{i}\\|_{% \mathcal{H}_{k}}^{2}=\max_{i=1,2,\dots,n}\sum_{j={R+1}}^{\infty}\frac{\lambda_% {j}}{n\lambda_{j}+\tau}\varphi_{j}^{2}(x_{i})\leq\sup_{x\in\mathcal{X}}\frac{1% }{\tau}\sum_{j={R+1}}^{\infty}\lambda_{j}\varphi_{j}^{2}(x)=\frac{T(R)}{\tau}$
	$\displaystyle B_{2}$	$\displaystyle=\\|\mathbf{Q}\mathbf{L}\mathbf{Q}\\|_{2}=\max_{j\in\mathbb{N},j>R}% \frac{\lambda_{j}}{n\lambda_{j}+\tau}\leq\frac{\lambda_{R+1}}{\tau}.$

On plugging this into Eqn. (16), we obtain the following bound on $E_{21}$ .

\displaystyle E_{21}\leq\frac{1}{\tau}\left\{\frac{42}{n}\log\left(\frac{12}{% \delta}\right)T(R),\lambda_{R+1}\right\}.

(17)

Combining Eqn. (15) and (17) yields us,

\displaystyle E_{2}\leq\frac{1}{\tau}\left\{42\log\left(\frac{12}{\delta}% \right)T(R),n\lambda_{R+1}\right\}\|\mathbf{Q}g\|_{\mathcal{H}_{k}}^{2}.

(18)

We now move onto the third term, $E_{3}$ , which contains the cross terms. For brevity of notation, we define $\zeta_{i}:=\mathbf{P}\mathbf{L}^{1/2}\omega_{i}$ and $\xi_{i}:=\mathbf{Q}\mathbf{L}^{1/2}\omega_{i}$ for all $i=1,2,\dots,n$ . Note that $\zeta_{i}^{\top}\xi_{j}=0$ for all $i,j=1,2,\dots,n$ . Since $\mathbf{P}$ and $\mathbf{Q}$ commute with $\mathbf{L}$ , a diagonal matrix, it is straightforward to note that $\mathbf{P}\mathbf{L}\mathbf{Q}=0$ . Using this relation along with the definition of $\{\zeta_{i}\}_{i=1}^{n}$ and $\{\xi_{i}\}_{i=1}^{n}$ , we can rewrite $E_{3}$ as follows:

$\displaystyle E_{3}$	$\displaystyle=\|g^{\top}\mathbf{P}(\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^% {-1/2}-\mathbf{Id})\mathbf{Q}g\|$
	$\displaystyle=\left\|g^{\top}\mathbf{P}\left(\sum_{i=1}^{n}(\mathbf{L}^{1/2}% \omega_{i})(\mathbf{L}^{1/2}\omega_{i})^{\top}-n\mathbf{L}\right)\mathbf{Q}g\right\|$
	$\displaystyle=\left\|\sum_{i=1}^{n}(g^{\top}\mathbf{P}\mathbf{L}^{1/2}\omega_{i% })(g^{\top}\mathbf{Q}\mathbf{L}^{1/2}\omega_{i})^{\top}\right\|$
	$\displaystyle=\left\|\sum_{i=1}^{n}\underbrace{(g^{\top}\zeta_{i})(g^{\top}\xi_% {i})}_{:=W_{i}}\right\|.$	(19)

We use Bernstein inequality to bound the sum of the random variables $W_{i}$ , for which we need the values of $\mathbb{E}[W_{i}]$ , $\mathbb{E}[W_{i}^{2}]$ and an upper bound on $|W_{i}|$ that holds almost surely. We begin with $\mathbb{E}[W_{i}]$ . We have,

\displaystyle\mathbb{E}[W_{i}]=\mathbb{E}[(g^{\top}\zeta_{i})(g^{\top}\xi_{i})% ]=g^{\top}\mathbb{E}[\zeta_{i}\xi_{i}^{\top}]g=0.

(20)

For an upper bound on $|W_{i}|$ , note that for any $g$ with $\|g\|_{\mathcal{H}_{k}}=1$ , $|W_{i}|$ is maximized for the choice of $g=\psi_{x_{i}}$ . Thus,

$\displaystyle\|W_{i}\|$	$\displaystyle=\\|g\\|_{\mathcal{H}_{k}}^{2}\left(\frac{g^{\top}\zeta_{i}}{\\|g\\|_% {\mathcal{H}_{k}}}\right)\left(\frac{g^{\top}\xi_{i}}{\\|g\\|_{\mathcal{H}_{k}}}\right)$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}^{2}(\psi_{x_{i}}^{\top}\zeta_{i})(% \psi_{x_{i}}^{\top}\xi_{i})$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}^{2}\\|\zeta_{i}\\|_{\mathcal{H}_{k}}^{2% }\\|\xi_{i}\\|_{\mathcal{H}_{k}}^{2}$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}^{2}\cdot\left(\sum_{j=1}^{R}\frac{% \lambda_{j}}{n\lambda_{j}+\tau}\varphi_{j}^{2}(x_{i})\right)\cdot\left(\sum_{j% =R+1}^{\infty}\frac{\lambda_{j}}{n\lambda_{j}+\tau}\varphi_{j}^{2}(x_{i})\right)$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}^{2}\cdot\left(\frac{1}{n}\sum_{j=1}^{% R}\varphi_{j}^{2}(x_{i})\right)\cdot\left(\frac{1}{\tau}\sum_{j=R+1}^{\infty}% \lambda_{j}\varphi_{j}^{2}(x_{i})\right)$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}^{2}\cdot\frac{N(R)}{n}\cdot\frac{T(R)% }{\tau}.$	(21)

From the above expressions, we can also conclude that $|g^{\top}\zeta_{i}|\leq\|g\|_{\mathcal{H}_{k}}\cdot\dfrac{N(R)}{n}$ and $|g^{\top}\xi_{i}|\leq\|g\|_{\mathcal{H}_{k}}\cdot\dfrac{T(R)}{\tau}$ . We use these relations to obtain a bound on $\mathbb{E}[W_{i}^{2}]$ . We have,

$\displaystyle\mathbb{E}[W_{i}^{2}]$	$\displaystyle=\mathbb{E}[(g^{\top}\zeta_{i})^{2}(g^{\top}\xi_{i})^{2}]$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}^{2}\cdot\min\left\{\mathbb{E}\left[(g% ^{\top}\zeta_{i})^{2}\right]\left(\frac{T(R)}{\tau}\right)^{2},\mathbb{E}\left% [(g^{\top}\xi_{i})^{2}\right]\left(\frac{N(R)}{n}\right)^{2}\right\}$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}^{2}\cdot\min\left\{(g^{\top}\mathbf{P% }L\mathbf{P}g)\cdot\left(\frac{T(R)}{\tau}\right)^{2},(g^{\top}\mathbf{Q}L% \mathbf{Q}g)\cdot\left(\frac{N(R)}{n}\right)^{2}\right\}$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}^{2}\cdot\min\left\{\frac{\\|\mathbf{P}% g\\|_{\mathcal{H}_{k}}^{2}}{n}\cdot\left(\frac{T(R)}{\tau}\right)^{2},\frac{% \lambda_{R+1}\\|\mathbf{Q}g\\|_{\mathcal{H}_{k}}^{2}}{\tau}\cdot\left(\frac{N(R)% }{n}\right)^{2}\right\}.$	(22)

In the last step, we used the bounds on $\|\mathbf{L}\|_{2}$ and $\|\mathbf{Q}\mathbf{L}\mathbf{Q}\|_{2}$ derived in the earlier part of the proof. Lastly, since $\mathbb{E}[W_{i}]=0$ , $\text{Var}(W_{i})=\mathbb{E}[W_{i}^{2}]$ . On applying Bernstein inequality (Wasserman, 2008, Lemma 7.37) using the relations from Eqns. (20), (21) and (22), we can conclude that the following relation holds with probability $1-\delta/6$ :

$\displaystyle E_{3}$	$\displaystyle=\left\|\sum_{i=1}^{n}(g^{\top}\zeta_{i})(g^{\top}\xi_{i})\right\|$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}\cdot\sqrt{2n\log\left(\frac{6}{\delta% }\right)\min\left\{\frac{\\|\mathbf{P}g\\|_{\mathcal{H}_{k}}^{2}}{n}\cdot\left(% \frac{T(R)}{\tau}\right)^{2},\frac{\lambda_{R+1}\\|\mathbf{Q}g\\|_{\mathcal{H}_{% k}}^{2}}{\tau}\cdot\left(\frac{N(R)}{n}\right)^{2}\right\}}$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{% }~{}+\\|g\\|_{\mathcal{H}_{k}}^{2}\cdot\frac{2N(R)}{3n}\cdot\frac{T(R)}{\tau}% \cdot\log\left(\frac{6}{\delta}\right).$	(23)

On plugging the results from Eqns. (13), (18) and (23) into Eqn. (10), we obtain

	$\displaystyle\\|\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id}% \\|_{2}$	$\displaystyle=\sup_{g:\\|g\\|_{\mathcal{H}_{k}}\leq 1}\|g^{\top}(\mathbf{Z}^{-1/2% }\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})g\|$
		$\displaystyle\leq\sup_{g:\\|g\\|_{\mathcal{H}_{k}}\leq 1}\bigg{[}\sqrt{\frac{3N(% R)\log(6R/\delta)}{n}}\\|\mathbf{P}g\\|_{\mathcal{H}_{k}}^{2}+\frac{1}{\tau}\max% \left\{42\log\left(\frac{12}{\delta}\right)T(R),n\lambda_{R+1}\right\}\\|% \mathbf{Q}g\\|_{\mathcal{H}_{k}}^{2}$
		$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+2\\|g\\|_{\mathcal{H}_{k}}% \sqrt{2n\log\left(\frac{6}{\delta}\right)\min\left\{\frac{\\|\mathbf{P}f\\|_{% \mathcal{H}_{k}}^{2}}{n}\cdot\left(\frac{T(R)}{\tau}\right)^{2},\frac{\lambda_% {R+1}\\|\mathbf{Q}f\\|_{\mathcal{H}_{k}}^{2}}{\tau}\cdot\left(\frac{N(R)}{n}% \right)^{2}\right\}}$
		$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\\|g% \\|_{\mathcal{H}_{k}}^{2}\cdot\frac{4N(R)}{3n}\cdot\frac{T(R)}{\tau}\cdot\log% \left(\frac{6}{\delta}\right)\bigg{]}$
		$\displaystyle\leq\bigg{[}\sqrt{\frac{3N(R)\log(6R/\delta)}{n}}+\frac{1}{\tau}% \max\left\{42\log\left(\frac{12}{\delta}\right)T(R),n\lambda_{R+1}\right\}$
		$\displaystyle~{}~{}~{}~{}+2\sqrt{2n\log\left(\frac{6}{\delta}\right)\min\left% \{\frac{1}{n}\cdot\left(\frac{T(R)}{\tau}\right)^{2},\frac{\lambda_{R+1}}{\tau% }\cdot\left(\frac{N(R)}{n}\right)^{2}\right\}}+\frac{4N(R)T(R)}{3n\tau}\log% \left(\frac{6}{\delta}\right)\bigg{]}$

On plugging in any value of $R\in\mathcal{R}(n,\tau,\delta)$ and using the definition of $\mathcal{R}_{n,\tau,\delta}$ along with the relation $n\geq\overline{N}$ , we can conclude that $\|\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id}\|_{2}\leq 1/9$ with probability at least $1-\delta/2$ . The overall probability on the bound is obtained using a union bound for the relations on $E_{1}$ , $E_{2}$ and $E_{3}$ .

A.2 Proof of Lemma 3.3

We begin the proof by showing that we can relate the $\psi_{x}^{\top}\hat{\mathbf{Z}}^{-1}\psi_{x}$ to $\psi_{x}^{\top}{\mathbf{Z}}^{-1}\psi_{x}$ through the operator norm of $\mathbf{M}:=\hat{\mathbf{Z}}^{-1/2}(\mathbf{Z}-\hat{\mathbf{Z}})\mathbf{Z}^{-1% /2}$ . Specifically, we show if that operator norm of $\mathbf{M}$ is small, then $\psi_{x}^{\top}\hat{\mathbf{Z}}^{-1}\psi_{x}$ and $\psi_{x}^{\top}{\mathbf{Z}}^{-1}\psi_{x}$ are within a constant factor of each other. Lastly, we use the condition on $\|\mathbf{Z}^{-\frac{1}{2}}\hat{\mathbf{Z}}\mathbf{Z}^{-\frac{1}{2}}-\mathbf{% Id}\|_{2}$ to bound the $\|\mathbf{M}\|_{\text{op}}$ , the operator norm of $\mathbf{M}$ , to obtain the required result.

We begin with considering the following expression.

$\displaystyle\left\|\psi_{x}^{\top}(\hat{\mathbf{Z}}^{-1}-\mathbf{Z}^{-1})\psi_% {x}\right\|$	$\displaystyle=\left\|\psi_{x}^{\top}\hat{\mathbf{Z}}^{-1}(\mathbf{Z}-\hat{% \mathbf{Z}})\mathbf{Z}^{-1}\psi_{x}\right\|$
	$\displaystyle=\left\|\psi_{x}^{\top}\hat{\mathbf{Z}}^{-1/2}\cdot\hat{\mathbf{Z}% }^{-1/2}(\mathbf{Z}-\hat{\mathbf{Z}})\mathbf{Z}^{-1/2}\cdot\mathbf{Z}^{-1/2}% \psi_{x}\right\|$
	$\displaystyle\leq\\|\hat{\mathbf{Z}}^{-1/2}\psi_{x}\\|_{\mathcal{H}_{k}}\\|{% \mathbf{Z}}^{-1/2}\psi_{x}\\|_{\mathcal{H}_{k}}\\|\hat{\mathbf{Z}}^{-1/2}(% \mathbf{Z}-\hat{\mathbf{Z}})\mathbf{Z}^{-1/2}\\|_{\text{op}}$
	$\displaystyle\leq\sqrt{(\psi_{x}^{\top}\hat{\mathbf{Z}}^{-1}\psi_{x})}\cdot% \sqrt{(\psi_{x}^{\top}\mathbf{Z}^{-1}\psi_{x})}\cdot\\|\mathbf{M}\\|_{\text{op}}.$	(24)

Consider the scenario where the relation $\|\mathbf{M}\|_{\text{op}}\leq c$ is satisfied for some $c\in(0,1)$ . We claim that under this scenario, we have, $\psi_{x}^{\top}\hat{\mathbf{Z}}^{-1}\psi_{x}\leq(1-c)^{-1}\cdot\psi_{x}^{\top}% {\mathbf{Z}}^{-1}\psi_{x}$ . To show this claim, we consider Eqn. (24). If $\psi_{x}^{\top}{\mathbf{Z}}^{-1}\psi_{x}\geq\psi_{x}^{\top}\hat{\mathbf{Z}}^{-% 1}\psi_{x}$ , the claim follows immediately. For the other case, we have,

	$\displaystyle\psi_{x}\hat{\mathbf{Z}}^{-1}\psi_{x}-\psi_{x}{\mathbf{Z}}^{-1}% \psi_{x}$	$\displaystyle\leq\sqrt{(\psi_{x}^{\top}\hat{\mathbf{Z}}^{-1}\psi_{x})}\cdot% \sqrt{(\psi_{x}^{\top}\mathbf{Z}^{-1}\psi_{x})}\cdot c$
		$\displaystyle\leq\sqrt{(\psi_{x}^{\top}\hat{\mathbf{Z}}^{-1}\psi_{x})}\cdot% \sqrt{(\psi_{x}^{\top}\hat{\mathbf{Z}}^{-1}\psi_{x})}\cdot c$
		$\displaystyle\leq c\cdot(\psi_{x}^{\top}\hat{\mathbf{Z}}^{-1}\psi_{x})$
	$\displaystyle\implies\psi_{x}\hat{\mathbf{Z}}^{-1}\psi_{x}$	$\displaystyle\leq\left(\psi_{x}{\mathbf{Z}}^{-1}\psi_{x}\right)\cdot\frac{1}{1% -c},$

as claimed. Thus, it suffices to show that $\|\mathbf{M}\|_{\text{op}}$ is small.

To that effect, note that we can write the operator $\mathbf{M}$ as $\mathbf{M}=\hat{\mathbf{Z}}^{-1/2}\mathbf{Z}^{1/2}-\hat{\mathbf{Z}}^{1/2}% \mathbf{Z}^{-1/2}=\mathbf{C}^{-1}-\mathbf{C}^{\top}$ where, $\mathbf{C}:=\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}^{1/2}$ . Consequently, using the definition of operator norm yields us,

$\displaystyle\\|\mathbf{M}\\|_{\text{op}}^{2}=\\|\mathbf{M}^{\top}\mathbf{M}\\|_{2}$	$\displaystyle=\\|((\mathbf{C}^{\top})^{-1}-\mathbf{C})(\mathbf{C}^{-1}-\mathbf{% C}^{\top})\\|_{2}$
	$\displaystyle=\\|(\mathbf{C}\mathbf{C}^{\top})^{-1}-\text{Id}+\mathbf{C}\mathbf% {C}^{\top}-\text{Id}\\|_{2}$
	$\displaystyle\leq\\|(\mathbf{C}\mathbf{C}^{\top})^{-1}-\text{Id}\\|_{2}+\\|% \mathbf{C}\mathbf{C}^{\top}-\text{Id}\\|_{2}.$	(25)

From the definition of $\mathbf{C}$ , we have $\|\mathbf{C}\mathbf{C}^{\top}-\text{Id}\|_{2}=\|\mathbf{Z}^{-\frac{1}{2}}\hat{% \mathbf{Z}}\mathbf{Z}^{-\frac{1}{2}}-\mathbf{Id}\|_{2}\leq b$ , from the given statement in the Lemma. Note that if $\|\mathbf{C}\mathbf{C}^{\top}-\text{Id}\|_{2}\leq b$ for some $b\in(0,1/3)$ , then all eigenvalues of $\mathbf{C}\mathbf{C}^{\top}$ lie in the interval $[1-b,1+b]$ . This implies that all the eigenvalues of $(\mathbf{C}\mathbf{C}^{\top})^{-1}$ lie in the interval $[(1+b)^{-1},(1-b)^{-1}]$ . Hence, $\|(\mathbf{C}\mathbf{C}^{\top})^{-1}-\text{Id}\|_{2}\leq b/(1-b)$ . On combining this with Eqn. (25), we can conclude that if $\|\mathbf{C}\mathbf{C}^{\top}-\text{Id}\|_{2}\leq b$ , then $\|\mathbf{M}\|_{\text{op}}\leq\sqrt{2b/(1-b)}<1$ . On combining this with the previous claim that relates $\psi_{x}^{\top}\hat{\mathbf{Z}}^{-1}\psi_{x}$ to $\psi_{x}^{\top}{\mathbf{Z}}^{-1}\psi_{x}$ through $\|\mathbf{M}\|_{\text{op}}$ , we arrive at the result.

A.3 Proof of Lemma 3.4

Similar to the analysis in Appendix A.1, we fix an $R\in\mathcal{R}(n,\tau,\delta)$ and define projection matrices $\mathbf{P}$ and $\mathbf{Q}$ using the value of $R$ as defined in Appendix A.1. We define the projection of the kernel operator $k(\cdot,\cdot)$ on the subspaces spanned by $\mathbf{P}$ and $\mathbf{Q}$ as follows:

\displaystyle k^{(\mathbf{P})}(x,y)=\sum_{j=1}^{R}\lambda_{j}\varphi_{j}(x)% \varphi_{j}(y);\quad k^{(\mathbf{Q})}(x,y)=k(x,y)-k^{(\mathbf{P})}(x,y).

Recall that $\tilde{\gamma}_{X_{n},\tau}$ denotes the information gain corresponding to the randomly drawn set of points $X_{n}=\{x_{1},x_{2},\dots,x_{n}\}$ . Similar to $K_{X_{n},X_{n}}$ , we also define $K^{(\mathbf{P})}_{X_{n},X_{n}}$ and $K^{(\mathbf{Q})}_{X_{n},X_{n}}$ as $K^{(\mathbf{P})}_{X_{n},X_{n}}=[k^{(\mathbf{P})}(x_{i},x_{j})]_{i,j=1}^{n}$ and $K^{(\mathbf{Q})}_{X_{n},X_{n}}=[k^{(\mathbf{Q})}(x_{i},x_{j})]_{i,j=1}^{n}$ . It is straightforward to note that $K_{X_{n},X_{n}}=K^{(\mathbf{P})}_{X_{n},X_{n}}+K^{(\mathbf{Q})}_{X_{n},X_{n}}$ .

We first derive some auxiliary results on the spectrum of $K^{(\mathbf{P})}_{X_{n},X_{n}}$ and $K^{(\mathbf{Q})}_{X_{n},X_{n}}$ which will be useful in the analysis later. Recall that we defined $\boldsymbol{\Psi}_{n}:=[\psi_{x_{1}},\psi_{x_{2}},\dots,\psi_{x_{n}}]$ . We can also rewrite $K_{X_{n},X_{n}}$ , $K^{(\mathbf{P})}_{X_{n},X_{n}}$ and $K^{(\mathbf{Q})}_{X_{n},X_{n}}$ in terms of $\boldsymbol{\Psi}_{n}$ as: $K_{X_{n},X_{n}}=\boldsymbol{\Psi}_{n}^{\top}\boldsymbol{\Psi}_{n}$ , $K^{(\mathbf{P})}_{X_{n},X_{n}}=\boldsymbol{\Psi}_{n}^{\top}P\boldsymbol{\Psi}_% {n}$ and $K^{(\mathbf{Q})}_{X_{n},X_{n}}=\boldsymbol{\Psi}_{n}^{\top}\mathbf{Q}% \boldsymbol{\Psi}_{n}$ . Using this relation, note that the singular values of $K^{(\mathbf{P})}_{X_{n},X_{n}}=(P\boldsymbol{\Psi}_{n})^{\top}(P\boldsymbol{% \Psi}_{n})$ and $K^{(\mathbf{Q})}_{X_{n},X_{n}}=\boldsymbol{\Psi}_{n}^{\top}\mathbf{Q}% \boldsymbol{\Psi}_{n}$ are the same as that of $(P\boldsymbol{\Psi}_{n})(P\boldsymbol{\Psi}_{n})^{\top}=P\boldsymbol{\Psi}_{n}% \boldsymbol{\Psi}_{n}^{\top}P$ and $(\mathbf{Q}\boldsymbol{\Psi}_{n})(\mathbf{Q}\boldsymbol{\Psi}_{n})^{\top}=% \mathbf{Q}\boldsymbol{\Psi}_{n}\boldsymbol{\Psi}_{n}^{\top}\mathbf{Q}$ respectively.

For the spectrum of $K^{(\mathbf{P})}_{X_{n},X_{n}}$ , note that

	$\displaystyle K^{(\mathbf{P})}_{X_{n},X_{n}}$	$\displaystyle=(\mathbf{P}\boldsymbol{\Psi}_{n})^{\top}(\mathbf{P}\boldsymbol{% \Psi}_{n})=((n\boldsymbol{\Lambda})^{-1/2}\mathbf{P}\boldsymbol{\Psi}_{n})^{% \top}(n\boldsymbol{\Lambda})((n\boldsymbol{\Lambda})^{-1/2}\mathbf{P}% \boldsymbol{\Psi}_{n})$
		$\displaystyle=(\mathbf{P}(n\boldsymbol{\Lambda})^{-1/2}\boldsymbol{\Psi}_{n})^% {\top}\mathbf{P}(n\boldsymbol{\Lambda})\mathbf{P}(\mathbf{P}(n\boldsymbol{% \Lambda})^{-1/2}\boldsymbol{\Psi}_{n}).$

If $\tilde{\lambda}_{1}\geq\tilde{\lambda}_{2}\geq\dots\geq\tilde{\lambda}_{R}$ denote the eigenvalues of $K^{(\mathbf{P})}_{X_{n},X_{n}}$ , then using Ostrowski’s Theorem Ostrowski (1959), we can conclude that $\tilde{\lambda}_{j}=\theta_{j}n\lambda_{j}$ for all $j=1,2,\dots,R$ , where $\{n\lambda_{j}\}_{j=1}^{R}$ correspond to the eigenvalues of $n\mathbf{P}\boldsymbol{\Lambda}\mathbf{P}$ and $\theta_{j}$ lie between the smallest and largest eigenvalues of the matrix $n^{-1}(\mathbf{P}\boldsymbol{\Lambda}^{-1/2}\boldsymbol{\Psi}_{n})^{\top}(% \mathbf{P}\boldsymbol{\Lambda}^{-1/2}\boldsymbol{\Psi}_{n})$ . Note that the singular values (in this case, also eigenvalues) of $n^{-1}(\mathbf{P}\boldsymbol{\Lambda}^{-1/2}\boldsymbol{\Psi}_{n})^{\top}(% \mathbf{P}\boldsymbol{\Lambda}^{-1/2}\boldsymbol{\Psi}_{n})$ are the same as that of $n^{-1}(\mathbf{P}\boldsymbol{\Lambda}^{-1/2}\boldsymbol{\Psi}_{n})(\mathbf{P}% \boldsymbol{\Lambda}^{-1/2}\boldsymbol{\Psi}_{n})^{\top}=n^{-1}\sum_{i=1}^{n}(% \mathbf{P}\omega_{i})(\mathbf{P}\omega_{i})^{\top}$ , where $\omega_{i}=\boldsymbol{\Lambda}^{-1/2}\psi_{x_{i}}$ , as defined in Appendix A.1. Using Eqn. (12) and that $R\in\mathcal{R}(n,\tau,\delta)$ and $n\geq\overline{N}$ , we can conclude that the following relation is true with probability $1-\delta/6$ :

\displaystyle\left\|\left(\frac{1}{n}\sum_{i=1}^{n}(\mathbf{P}\omega_{i})(% \mathbf{P}\omega_{i})^{\top}-\mathbf{P}\right)\right\|_{2}\leq\frac{1}{27}.

Thus, we can conclude that eigenvalues of $n^{-1}(\mathbf{P}\boldsymbol{\Lambda}^{-1/2}\boldsymbol{\Psi}_{n})^{\top}(% \mathbf{P}\boldsymbol{\Lambda}^{-1/2}\boldsymbol{\Psi}_{n})$ lie in the range $[26/27,28/27]$ and consequently, $\tilde{\lambda}_{j}\geq 26n\lambda_{j}/27$ .

As mentioned earlier, the singular values of $K^{(\mathbf{Q})}_{X_{n},X_{n}}$ are the same as those of $\mathbf{Q}\boldsymbol{\Psi}_{n}\boldsymbol{\Psi}_{n}^{\top}\mathbf{Q}$ . For the analysis, it suffices to have an upper bound on $\|K^{(\mathbf{Q})}_{X_{n},X_{n}}\|_{2}$ , or equivalently, $\|\mathbf{Q}\boldsymbol{\Psi}_{n}\boldsymbol{\Psi}_{n}^{\top}\mathbf{Q}\|_{2}$ . Using the result from Moeller and Ullrich (2021, Proposition 3.8), we know that the following relation holds with probability $1-\delta/6$ :

\displaystyle\|\mathbf{Q}\boldsymbol{\Psi}_{n}\boldsymbol{\Psi}_{n}^{\top}% \mathbf{Q}\|_{2}\leq 2\left\{42\log\left(\frac{12}{\delta}\right)T(R),n\lambda% _{R+1}\right\}.

Since $R\in\mathcal{R}_{n,\tau}$ , we can conclude that $\|K^{(\mathbf{Q})}_{X_{n},X_{n}}\|_{2}=\|\mathbf{Q}\boldsymbol{\Psi}_{n}% \boldsymbol{\Psi}_{n}^{\top}\mathbf{Q}\|_{2}\leq 2\tau/27$ . We are now ready to prove the lemma.

Using the relation $K_{X_{n},X_{n}}=K^{(\mathbf{P})}_{X_{n},X_{n}}+K^{(\mathbf{Q})}_{X_{n},X_{n}}$ , we can decompose the information gain of $X_{n}$ as follows:

	$\displaystyle\tilde{\gamma}_{X_{n},\tau}$	$\displaystyle=\frac{1}{2}\log\left(\det(I_{n}+\tau^{-1}K_{X_{n},X_{n}})\right)$
		$\displaystyle=\frac{1}{2}\log\left(\det(I_{n}+\tau^{-1}K^{(\mathbf{P})}_{X_{n}% ,X_{n}}+\tau^{-1}K^{(\mathbf{Q})}_{X_{n},X_{n}})\right)$
		$\displaystyle=\frac{1}{2}\log\left(\det((I_{n}+\tau^{-1}K^{(\mathbf{Q})}_{X_{n% },X_{n}})(I_{n}+\tau^{-1}(I_{n}+\tau^{-1}K^{(\mathbf{Q})}_{X_{n},X_{n}})^{-1}K% ^{(\mathbf{P})}_{X_{n},X_{n}}))\right)$
		$\displaystyle=\frac{1}{2}\underbrace{\log(\det(I+\tau^{-1}K^{(\mathbf{Q})}_{X_% {n},X_{n}}))}_{:=G_{1}}+\frac{1}{2}\underbrace{\log(\det(I+\tau^{-1}(I+\tau^{-% 1}K^{(\mathbf{Q})}_{X_{n},X_{n}})^{-1}K^{(\mathbf{P})}_{X_{n},X_{n}}))}_{:=G_{% 2}}.$

This decomposition is similar to that derived in Vakili et al. (2021b, App. A, Eqn. 8) with the roles of $K^{(\mathbf{P})}_{X_{n},X_{n}}$ and $K^{(\mathbf{Q})}_{X_{n},X_{n}}$ interchanged.

We begin with $G_{1}$ . Since $\|K^{(\mathbf{Q})}_{X_{n},X_{n}}\|_{2}\leq 2\tau/27$ , all eigenvalues of $\tau^{-1}K^{(\mathbf{Q})}_{X_{n},X_{n}}$ are less than $1$ . Using the relation $\log(1+x)\geq x/2$ , which holds for all $x\in[0,1]$ , we can lower bound $G_{1}$ as follows:

\displaystyle G_{1}=\log(\det(I+\tau^{-1}K^{(\mathbf{Q})}_{X_{n},X_{n}}))\geq% \frac{1}{2\tau}\mathrm{trace}(K^{(\mathbf{Q})}_{X_{n},X_{n}}).

Note $k^{(\mathbf{Q})}(X_{i},X_{i})$ are i.i.d. random variables with $\mathbb{E}[k^{(\mathbf{Q})}(X_{i},X_{i})]=\sum_{r=R+1}^{\infty}\lambda_{r}$ and $|k^{(\mathbf{Q})}(X_{i},X_{i})|\leq T(R)$ . We can thus use Hoeffding inequality to obtain the following bound on $\mathrm{trace}(K^{(\mathbf{Q})}_{X_{n},X_{n}})$ which holds with probability at least $1-\delta/6$ :

	$\displaystyle G_{1}$	$\displaystyle\geq\frac{1}{2\tau}\mathrm{trace}(K^{(O)}_{X_{n},X_{n}})$
		$\displaystyle\geq\frac{1}{2\tau}\left[n\sum_{r=R+1}^{\infty}\lambda_{r}-T(R)% \sqrt{n\log(12/\delta)}\right]$
		$\displaystyle\geq\frac{nT(R)}{2\tau F^{2}}\left(1-F^{2}\sqrt{\frac{\log(12/% \delta)}{n}}\right)$
		$\displaystyle\geq\frac{13nT(R)}{27\tau F^{2}}$

In the third line, we used the fact that $T(R)\leq F^{2}\sum_{r=R+1}^{\infty}\lambda_{r}$ since $\|\varphi_{j}\|_{\infty}\leq F$ for all $j\in\mathbb{N}$ (Assumption 2.3). The fourth line uses the condition that $n\geq\overline{N}$ .

To bound $G_{2}$ , first note that using the condition on the spectrum on $\tau^{-1}K^{(\mathbf{Q})}_{X_{n},X_{n}}$ , we can conclude that all the eigenvalues of $(I+\tau^{-1}K^{(\mathbf{Q})}_{X_{n},X_{n}})$ lie in the range $[1,2]$ . Moreover, note that the spectrum of $(I+\tau^{-1}K^{(\mathbf{Q})}_{X_{n},X_{n}})^{-1}K^{(\mathbf{P})}_{X_{n},X_{n}}$ is the same as that of $(I+\tau^{-1}K^{(\mathbf{Q})}_{X_{n},X_{n}})^{-1/2}K^{(\mathbf{P})}_{X_{n},X_{n% }}(I+\tau^{-1}K^{(\mathbf{Q})}_{X_{n},X_{n}})^{-1/2}$ . On using Ostrowski’s Theorem Ostrowski (1959) along with range of eigenvalues of $(I+\tau^{-1}K^{(\mathbf{Q})}_{X_{n},X_{n}})$ , we can conclude that

\displaystyle G_{2}=\log(\det(I+\tau^{-1}(I+\tau^{-1}K^{(\mathbf{Q})}_{X_{n},X% _{n}})^{-1}K^{(\mathbf{P})}_{X_{n},X_{n}}))\geq\log(\det(I+(2\tau)^{-1}K^{(% \mathbf{P})}_{X_{n},X_{n}})).

Using the relation for the eigenvalues of $K^{(\mathbf{P})}_{X_{n},X_{n}}$ derived earlier, we can further $G_{2}$ as follows:

	$\displaystyle G_{2}$	$\displaystyle\geq\log(\det(I+(2\tau)^{-1}K^{(\mathbf{P})}_{X_{n},X_{n}}))$
		$\displaystyle\geq\sum_{j=1}^{R}\log(1+(2\tau)^{-1}\tilde{\lambda}_{j})$
		$\displaystyle\geq\sum_{j=1}^{R}\log\left(1+\frac{13n\lambda_{j}}{27\tau}\right)$
		$\displaystyle\geq\sum_{j=1}^{R}\frac{13n\lambda_{j}}{13n\lambda_{j}+27\tau}$
		$\displaystyle\geq\frac{13n}{27F^{2}}\sup_{x\in\mathcal{X}}\sum_{j=1}^{R}\frac{% \lambda_{j}}{n\lambda_{j}+\tau}\varphi_{j}^{2}(x).$

In the fourth line, we used the relation $\log(1+x)\geq\frac{x}{x+1}$ , which holds for all $x\geq 0$ .

On combining the bounds for $G_{1}$ and $G_{2}$ , we obtain,

	$\displaystyle\tilde{\gamma}_{X_{n},\tau}$	$\displaystyle=\frac{1}{2}(G_{1}+G_{2})$
		$\displaystyle\geq\frac{13nT(R)}{54\tau F^{2}}+\frac{13n}{54F^{2}}\sup_{x\in% \mathcal{X}}\sum_{j=1}^{R}\frac{\lambda_{j}}{n\lambda_{j}+\tau}\varphi_{j}^{2}% (x)$
		$\displaystyle\geq\frac{13n}{54F^{2}}\left(\sup_{x\in\mathcal{X}}\sum_{j=1}^{R}% \frac{\lambda_{j}}{n\lambda_{j}+\tau}\varphi^{2}(x)+\frac{T(R)}{\tau}\right)$
		$\displaystyle\geq\frac{13n}{54F^{2}}\sup_{x\in\mathcal{X}}\left(\sum_{j=1}^{R}% \frac{\lambda_{j}}{n\lambda_{j}+\tau}\varphi_{j}^{2}(x)+\sum_{j=R+1}^{\infty}% \frac{\lambda_{j}}{n\lambda_{j}+\tau}\varphi_{j}^{2}(x)\right)$
		$\displaystyle\geq\frac{13n}{54F^{2}}\sup_{x\in\mathcal{X}}\psi_{x}^{\top}% \mathbf{Z}^{-1}\psi_{x},$

as required. Since each of the bounds on $G_{1}$ and the eigenvalues of $K^{(\mathbf{P})}_{X_{n},X_{n}}$ and $K^{(\mathbf{Q})}_{X_{n},X_{n}}$ , holds with probability at least $1-\delta/6$ , the overall bound holds with probability at least $1-\delta/2$ .

Appendix B Proof of Theorems 4.3 and 4.5

The proof of both the theorems is based along the lines of the proof of the Batched Pure Exploration (BPE) algorithm Li and Scarlett (2022). We first begin with a brief discussion about Assumption 4 and then move on to the proof.

Definition B.1.

Let $\Gamma:\mathcal{X}\to\mathcal{X}^{\prime}$ be a map between two sets $\mathcal{X},\mathcal{X}^{\prime}\subset\mathbb{R}^{d}$ . We call $\Gamma$ to be a bi-Lipschitz map if the inverse map, $\Gamma^{-1}$ , exists and the following relations hold for some $L,L^{\prime}>0$ :

	$\displaystyle\\|\Gamma(x)-\Gamma(y)\\|_{2}$	$\displaystyle\leq L\\|x-y\\|_{2}\ \ \forall x,y\in\mathcal{X}$
	$\displaystyle\\|\Gamma^{-1}(x)-\Gamma^{-1}(y)\\|_{2}$	$\displaystyle\leq L^{\prime}\\|x-y\\|_{2}\ \ \forall x,y\in\mathcal{X}^{\prime}.$

We refer to $(L,L^{\prime})$ the Lipschitz constant pair of $\Gamma$ . We also define normalized Lipschitz constant pair of $\Gamma$ to be the pair $(\tilde{L},\tilde{L}^{\prime})=\left(L\left(\frac{\mathrm{vol}(\mathcal{X})}{% \mathrm{vol}(\mathcal{X}^{\prime})}\right)^{1/d},L^{\prime}\left(\frac{\mathrm% {vol}(\mathcal{X}^{\prime})}{\mathrm{vol}(\mathcal{X})}\right)^{1/d}\right)$ .

The normalized Lipschitz constant pair quantifies solely the change due to structure and discounts for the change in size between $\mathcal{X}$ and $\mathcal{X}^{\prime}$ . The following is a restatement of Assumption 4.

Assumption B.2.

Let $\mathcal{L}_{\eta}=\{x\in\mathcal{X}|f(x)\geq\eta\}$ denote the level set of $f$ for $\eta\in[-B,B]$ . Then,

•

For all $\eta\in[-B,B]$ , $\mathcal{L}_{\eta}$ is a disjoint union of at most $M_{f}<\infty$ closed, path connected components.
•

For a given $\eta\in[-B,B]$ , let $\mathcal{L}_{\eta}^{i}$ denote the $i^{\text{th}}$ such connected component of $\mathcal{L}_{\eta}$ . We assume that there exists a bi-Lipschitzian map $\Gamma_{\eta,i}:\mathcal{X}\to\mathcal{L}_{\eta}^{i}$ with normalized Lipschitz constant pair $\tilde{L}_{\eta,i},\tilde{L}_{\eta,i}^{\prime}>0$ for all $\eta,i$ . Let $L_{f}=\sup_{\eta,i}\tilde{L}_{\eta,i}$ and $L_{f}^{\prime}=\sup_{\eta,i}\tilde{L}_{\eta,i}^{\prime}$ . We assume that $L_{f},L_{f}^{\prime}<\infty$ .

Assumption 4 is an assumption on the regularity of the level sets of the function $f$ . The term $M_{f}$ can be thought of as the number of local maximas of $f$ and hence finiteness of $M_{f}$ is a mild assumption on $f$ satisfied by functions encountered in practice. Moreover, the knowledge of $M_{f}$ is only required for analysis and not for the algorithm to run. The second condition on $f$ is to ensure that the these connected components are topologically regular enough and to avoid certain pathological cases. In particular, the existence of a bi-Lipschitzian map between two sets implies topological similarity between the two sets. Intuitively, this assumption ensures that the shape of the level-sets is not “too arbitrary”. Note that such an assumption on the level sets of $f$ is relatively mild as the RKHS endows smoothness properties to the function $f$ which translate to a degree of topological regularity of level sets Alberti et al. (2011); Lee (2010).

B.1 Proof of Theorem 4.3

At a high level, the bound on regret is obtained by first separately bounding the regret during every epoch $r$ and then summing it across all epochs. During any epoch $r$ , since REDS chooses points uniformly at random from the current domain $\mathcal{X}_{r}$ , we simply bound the regret incurred at each point queried during this epoch by the worst case scenario, i.e., $\varsigma_{r}:=f(x^{*})-\inf_{x\in\mathcal{X}_{r}}f(x)$ . This leads to an upper bound of $\varsigma_{r}N_{r}M_{f}$ on the regret incurred during epoch $r$ , as there are at most $M_{f}$ connected components in each level set. Since poorly performing regions of the domain are eliminated as the algorithm proceeds, $\inf_{x\in\mathcal{X}_{r}}f(x)$ gets closer to $f(x^{*})$ , reducing the regret in each epoch as the algorithm proceeds.

The following two lemmas ensure the correctness of the algorithm and help bound the regret incurred during each epoch.

Lemma B.3.

$x^{*}\in\mathcal{X}_{r}$ for all $r\geq 1$ .

Lemma B.4.

For all epochs $r$ , we have,

\displaystyle\varsigma_{r}\leq\begin{cases}2B&\text{ if }r=1,\\ 4B\sup_{x\in\mathcal{X}_{r-1}}\sigma_{r-1}(x)&\text{ if }r\geq 2.\end{cases}

We defer the proof of these lemmas to Appendix B.3. Equipped with these lemmas, we move on to the proof of Theorem 4.3. The regret incurred by REDS can be bounded as

	$\displaystyle R(T)$	$\displaystyle=\sum_{t=1}^{T}f(x^{*})-f(x_{t})\leq\sum_{r=1}^{S}\varsigma_{r}N_% {r}M_{f}$
		$\displaystyle\leq 2BN_{1}+4BM_{f}\sum_{r=2}^{S}\left[N_{r}\cdot\sup_{x\in% \mathcal{X}_{r-1}}\sigma_{r-1}(x)\right].$

In the above expression, $S$ denotes the total number of epochs that begin during a run of REDS algorithm before reaching a total of $T$ queries. Since the epoch lengths double every epoch, we have $S\leq 1+\log_{2}(T/N_{1})$ . We can further bound $R(T)$ using Lemma 4.6 (which in turn is based on Theorem 3.1) to bound the worst-case posterior standard deviation in the above equation. Since $\mathcal{X}_{r-1}$ is compact ( $\mathcal{X}_{r-1}$ is closed by definition and $\mathcal{X}_{r-1}$ is bounded because $\mathcal{X}_{r-1}\subseteq\mathcal{X}$ ) and $N_{r-1}\geq N_{1}\geq C_{L_{f},L_{f}^{\prime}}\overline{N}$ , we can invoke Lemma 4.6 to conclude

\displaystyle R(T)

\displaystyle\leq 2BN_{1}+4BC_{2}C_{L_{f},L_{f}^{\prime}}^{\prime}M_{f}\sum_{r% =2}^{S}N_{r}\cdot N_{r-1}^{(1-\beta)/2}(\log(n/\delta^{\prime}))^{\beta/2},

(26)

where $\delta^{\prime}=\delta/\log_{2}T$ , $C_{2}=\sqrt{C_{1}}$ and $C_{L_{f},L_{f}^{\prime}},C_{L_{f},L_{f}^{\prime}}^{\prime}$ are the constants from Lemma 4.6 and depend only on $L_{f},L_{f}^{\prime}$ . For simplicity, we define $C_{f}:=C_{L_{f},L_{f}^{\prime}}^{\prime}M_{f}$ , as a constant that depends only on the function $f$ . On plugging in the values of $N_{r}$ , Eqn. (26) simplifies to

$\displaystyle R(T)$	$\displaystyle\leq 2BN_{1}+4BC_{2}C_{f}\sum_{r=2}^{S}N_{r}\cdot N_{r-1}^{(1-% \beta)/2}(\log(n/\delta^{\prime}))^{\beta/2}$
	$\displaystyle\leq 2BN_{1}+4BC_{2}C_{f}N_{1}^{(3-\beta)/2}\sum_{r=2}^{S}2^{r-1}% \cdot 2^{(r-2)(1-\beta)/2}\left(\log\left(\frac{N_{1}}{\delta^{\prime}}\cdot 2% ^{r-2}\right)\right)^{\beta/2}$
	$\displaystyle\leq 2BN_{1}+8BC_{2}C_{f}N_{1}^{(3-\beta)/2}\sum_{r=0}^{S-2}2^{r(% 3-\beta)/2}\left(\log\left(\frac{N_{1}}{\delta^{\prime}}\cdot 2^{r}\right)% \right)^{\beta/2}.$	(27)

We consider three separate cases based on the value of $\beta$ :

•

$\beta<3$ : Under this case, Eqn. (27) can be simplified as follows:

	$\displaystyle R(T)$	$\displaystyle\leq 2BN_{1}+8BC_{2}C_{f}N_{1}^{(3-\beta)/2}\sum_{r=0}^{S-2}2^{r(% 3-\beta)/2}\left(\log\left(\frac{N_{1}}{\delta^{\prime}}\cdot 2^{r}\right)% \right)^{\beta/2}$
		$\displaystyle\leq 2BN_{1}+8BC_{2}C_{f}N_{1}^{(3-\beta)/2}\left(\log\left(\frac% {T}{\delta^{\prime}}\right)\right)^{3/2}\sum_{r=0}^{S-2}2^{r(3-\beta)/2}$
		$\displaystyle\leq 2BN_{1}+8BC_{2}C_{f}N_{1}^{(3-\beta)/2}\left(\log\left(\frac% {T}{\delta^{\prime}}\right)\right)^{3/2}\frac{2^{(S-1)(3-\beta)/2}-1}{2^{(3-% \beta)/2}-1}$
		$\displaystyle\leq 2BN_{1}+\frac{8BC_{2}C_{f}}{2^{(3-\beta)/2}-1}T^{(3-\beta)/2% }\left(\log\left(\frac{T}{\delta^{\prime}}\right)\right)^{3/2}.$

•

$\beta=3$ : For this value of $\beta$ , Eqn. (27) can be simplified as follows:

	$\displaystyle R(T)$	$\displaystyle\leq 2BN_{1}+8BC_{2}C_{f}N_{1}^{(3-\beta)/2}\sum_{r=0}^{S-2}2^{r(% 3-\beta)/2}\left(\log\left(\frac{N_{1}}{\delta^{\prime}}\cdot 2^{r}\right)% \right)^{\beta/2}$
		$\displaystyle\leq 2BN_{1}+8BC_{2}C_{f}\cdot\left(\log\left(\frac{T}{\delta^{% \prime}}\right)\right)^{3/2}\cdot\sum_{r=0}^{S-2}1$
		$\displaystyle\leq 2BN_{1}+8BC_{2}C_{f}\cdot\left(\log\left(\frac{T}{\delta^{% \prime}}\right)\right)^{3/2}\cdot\log\left(\frac{T}{N_{1}}\right).$

•

$\beta>3$ : For this range, we have,

	$\displaystyle R(T)$	$\displaystyle\leq 2BN_{1}+8BC_{2}C_{f}N_{1}^{(3-\beta)/2}\sum_{r=0}^{S-2}2^{r(% 3-\beta)/2}\left(\log\left(\frac{N_{1}}{\delta^{\prime}}\cdot 2^{r}\right)% \right)^{\beta/2}$
		$\displaystyle\leq 2BN_{1}+8BC_{2}C_{f}\cdot\left(\log\left(\frac{T}{\delta^{% \prime}}\right)\right)^{3/2}\cdot\sum_{r=0}^{S-2}2^{r(3-\beta)/4}\left[\frac{% \log(N_{1}\cdot 2^{r})+\log(1/\delta^{\prime})}{N_{1}\cdot 2^{r/2}}\right]^{(% \beta-3)/2}$
		$\displaystyle\leq 2BN_{1}+8BC_{2}C_{f}\cdot\left(\log\left(\frac{T}{\delta^{% \prime}}\right)\right)^{3/2}\cdot\left[\frac{\log(N_{1}/\delta^{\prime})}{N_{1% }}\right]^{(\beta-3)/2}\cdot\sum_{r=0}^{S-2}2^{r(3-\beta)/4}$
		$\displaystyle\leq 2BN_{1}+8BC_{2}C_{f}\cdot\left(\log\left(\frac{T}{\delta^{% \prime}}\right)\right)^{3/2}\cdot\left[\frac{\log(N_{1}/\delta^{\prime})}{N_{1% }}\right]^{(\beta-3)/2}\cdot\sum_{r=0}^{\infty}2^{r(3-\beta)/4}$
		$\displaystyle\leq 2BN_{1}+\frac{8BC_{2}C_{f}}{1-2^{(3-\beta)/4}}\cdot\left(% \log\left(\frac{T}{\delta^{\prime}}\right)\right)^{3/2}.$

In the third step, we used the fact that $\dfrac{\log(N_{1}\cdot 2^{r}/\delta^{\prime})}{N_{1}\cdot 2^{r/2}}$ is a decreasing function of $r$ for all $r\geq 0$ and in the fifth step we used the fact that $N_{1}\geq\log(N_{1}/\delta^{\prime})$ since $N_{1}\geq\overline{N}(\delta^{\prime})$ .

On combining all the cases, we arrive at the result. The statement in Corollary 4.4 follows immediately from the above proof by plugging in $\beta=1+2\nu/d$ .

B.2 Proof of Theorem 4.5

The proof of Theorem 4.5 is almost identical to that of Theorem 4.3. The following lemma is a counterpart to Lemma B.4 for the noisy case.

Lemma B.5.

For all epochs $r$ , the following relation holds with probability at least $1-\delta/2$ :

\displaystyle\varsigma_{r}\leq\begin{cases}2B&\text{ if }r=1,\\ 4\alpha_{\tau}(\delta^{\prime}/2)\left[\sup_{x\in\mathcal{X}_{r-1}}\sigma_{r-1% ,\tau}(x)\right]+\frac{2B}{T}+R\sqrt{\frac{2}{T\tau}\log\left(\frac{4T}{\delta% ^{\prime}}\right)}&\text{ if }r\geq 2.\end{cases}

The proof of this lemma is identical to that of Lemma B.4 with the definitions of $\mathrm{UCB}$ and $\mathrm{LCB}$ changed according to the noisy setup (See Vakili et al. (2021a) for an exact derivation). On using Lemma 4.6 (for the noisy case) along with Lemma B.5, we can rewrite Eqn. (26) as

	$\displaystyle R(T)$	$\displaystyle\leq 2BN_{1}+M_{f}\sum_{r=2}^{S}N_{r}\cdot\left[4\sqrt{C_{\tau}}C% _{L_{f},L_{f}^{\prime}}^{\prime}\alpha_{\tau}(\delta^{\prime}/2)\sqrt{\frac{% \gamma_{N_{r-1},\tau}}{N_{r-1}}}+\frac{2B}{T}+R\sqrt{\frac{2}{T\tau}\log\left(% \frac{4T}{\delta^{\prime}}\right)}\right]$
		$\displaystyle\leq 2BN_{1}+M_{f}\sum_{r=2}^{S}N_{r}\cdot\left[4\sqrt{C_{\tau}}C% _{L_{f},L_{f}^{\prime}}^{\prime}\alpha_{\tau}(\delta^{\prime}/2)\sqrt{\frac{% \gamma_{T,\tau}}{N_{r-1}}}+\frac{2B}{T}+R\sqrt{\frac{2}{T\tau}\log\left(\frac{% 4T}{\delta^{\prime}}\right)}\right],$		(28)

where second line follows using monotonicity of $\gamma_{n,\tau}$ i.e., $\gamma_{n_{1},\tau}\leq\gamma_{n_{2},\tau}$ for all $n_{1}\leq n_{2}$ and $C_{\tau}$ is the leading constant in Eqn. (9). On plugging in the values of $N_{r}$ in Eqn. (28), we obtain,

	$\displaystyle R(T)$	$\displaystyle\leq 2BN_{1}+\sum_{r=2}^{S}N_{r}\cdot\left[4\sqrt{C_{\tau}}C_{L_{% f},L_{f}^{\prime}}^{\prime}M_{f}\alpha_{\tau}(\delta^{\prime}/2)\sqrt{\frac{% \gamma_{T,\tau}}{N_{r-1}}}+\frac{2BM_{f}}{T}+RM_{f}\sqrt{\frac{2}{T\tau}\log% \left(\frac{4T}{\delta^{\prime}}\right)}\right]$
		$\displaystyle\leq 2BN_{1}+\sum_{r=2}^{S}\left[4\sqrt{N_{1}C_{\tau}}C_{f}\alpha% _{\tau}(\delta^{\prime}/2)\sqrt{\gamma_{T,\tau}}\cdot 2^{r-1}\cdot 2^{-(r-2)/2% }+M_{f}\cdot\frac{2BN_{1}}{T}\cdot 2^{r-1}+M_{f}\cdot RN_{1}\sqrt{\frac{2}{T% \tau}\log\left(\frac{4T}{\delta^{\prime}}\right)}\cdot 2^{r-1}\right]$
		$\displaystyle\leq 2BN_{1}+8\sqrt{N_{1}C_{\tau}}C_{f}\alpha_{\tau}(\delta^{% \prime}/2)\sqrt{\gamma_{T,\tau}}\left(\sum_{r=0}^{S-2}2^{r/2}\right)+M_{f}% \cdot\left(\frac{4B}{T}+2R\sqrt{\frac{2}{T\tau}\log\left(\frac{4T}{\delta^{% \prime}}\right)}\right)N_{1}\left(\sum_{r=0}^{S-2}2^{r}\right)$
		$\displaystyle\leq 2BN_{1}+\frac{8}{\sqrt{2}-1}\sqrt{N_{1}C_{\tau}}C_{f}\alpha_% {\tau}(\delta^{\prime}/2)\sqrt{\gamma_{T,\tau}}\cdot\sqrt{\frac{T}{N_{1}}}+M_{% f}\cdot\left(\frac{4B}{T}+2R\sqrt{\frac{2}{T\tau}\log\left(\frac{4T}{\delta^{% \prime}}\right)}\right)\cdot N_{1}\cdot\frac{T}{N_{1}}$
		$\displaystyle\leq 2BN_{1}+\frac{8}{\sqrt{2}-1}\sqrt{C_{\tau}}C_{f}\alpha_{\tau% }(\delta^{\prime}/2)\sqrt{T\gamma_{T,\tau}}+4BM_{f}+2RM_{f}\sqrt{\frac{2T}{% \tau}\log\left(\frac{4T}{\delta^{\prime}}\right)},$

where $C_{f}=C_{L_{f},L_{f}^{\prime}}^{\prime}M_{f}$ as before. Hence, $R(T)$ satisfies $\tilde{\mathcal{O}}(\sqrt{T\gamma_{T,\tau}})$ , as required.

B.3 Proof of Auxiliary Lemmas

B.3.1 Proof of Lemma B.3

The main ingredient in the proof is the relation: $|f(x)-\mu_{r-1}(x)|\leq B\sigma_{r-1}(x)$ , which holds for all $x\in\mathcal{X}_{r-1}$ and across all epochs $r$ . This is a well-known relation in the literature Vakili et al. (2021a); Lyu et al. (2020) that bounds the predictive performance of the posterior mean in terms of posterior variance.

We use induction to prove the lemma. Since $\mathcal{X}_{1}=\mathcal{X}$ and $x^{*}\in\mathcal{X}$ holds by definition, $x^{*}\in\mathcal{X}_{1}$ . Assume that $x^{*}\in\mathcal{X}_{r-1}$ . Using the relation $|f(x)-\mu_{r-1}(x)|\leq B\sigma_{r-1}(x)$ , we can conclude,

\displaystyle\sup_{x^{\prime}\in\mathcal{X}_{r-1}}\mathrm{LCB}_{r-1}(x^{\prime% })=\sup_{x^{\prime}\in\mathcal{X}_{r-1}}(\mu_{r-1}(x^{\prime})-B\sigma_{r-1}(x% ^{\prime}))\leq\sup_{x^{\prime}\in\mathcal{X}_{r-1}}f(x^{\prime})=f(x^{*})\leq% \mathrm{UCB}_{r-1}(x^{*}),

where we used the inductive hypothesis to establish $\sup_{x^{\prime}\in\mathcal{X}_{r-1}}f(x^{\prime})=f(x^{*})$ . This implies that $x^{*}\in\mathcal{X}_{r}$ , as required.

B.3.2 Proof of Lemma B.4

We separately show the bounds for $r=1$ and $r\geq 2$ . For the first epoch, we have,

\displaystyle\varsigma_{1}=f(x^{*})-\inf_{x\in\mathcal{X}_{1}}f(x)=f(x^{*})-% \inf_{x\in\mathcal{X}}f(x)\leq 2\sup_{x\in\mathcal{X}}f(x)\leq 2B.

We used the fact that $\sup_{x\in\mathcal{X}}f(x)=\sup_{x\in\mathcal{X}}f^{\top}\psi_{x}\leq\sup_{x% \in\mathcal{X}}\|f\|_{\mathcal{H}_{k}}\|\psi_{x}\|_{\mathcal{H}_{k}}\leq B$ . Consider any epoch $r\geq 2$ . For the analysis, we define

\displaystyle\mathcal{X}_{r}^{\prime}:=\{x\in\mathcal{X}_{r-1}:f(x)+2B\sigma_{% r-1}(x)\geq\sup_{x^{\prime}\in\mathcal{X}_{r-1}}f(x^{\prime})-2B\sigma_{r-1}(x% ^{\prime})\}.

The region $\mathcal{X}_{r}^{\prime}$ satisfies $\mathcal{X}_{r}\subseteq\mathcal{X}_{r}^{\prime}$ . To establish this, we once again employ the relation $|f(x)-\mu_{r-1}(x)|\leq B\sigma_{r-1}(x)$ . Using the relation, we can conclude that

	$\displaystyle\mathrm{UCB}_{r-1}(x)$	$\displaystyle=\mu_{r-1}(x)+B\sigma_{r-1}(x)\leq(f(x)+B\sigma_{r-1}(x))+B\sigma% _{r-1}(x)=f(x)+2B\sigma_{r-1}(x)$
	$\displaystyle\mathrm{LCB}_{r-1}(x)$	$\displaystyle=\mu_{r-1}(x)-B\sigma_{r-1}(x)\geq(f(x)-B\sigma_{r-1}(x))-B\sigma% _{r-1}(x)=f(x)-2B\sigma_{r-1}(x).$

The inclusion $\mathcal{X}_{r}\subseteq\mathcal{X}_{r}^{\prime}$ follows immediately from the definition of $\mathcal{X}_{r}$ and $\mathcal{X}_{r}^{\prime}$ and the above expressions.

Consider the following relation which holds for any $x\in\mathcal{X}_{r}^{\prime}$ .

$\displaystyle f(x)+2B\sigma_{r-1}(x)$	$\displaystyle\geq\sup_{x^{\prime}\in\mathcal{X}_{r-1}}f(x^{\prime})-2B\sigma_{% r-1}(x^{\prime})$
$\displaystyle\implies f(x)$	$\displaystyle\geq\sup_{x^{\prime}\in\mathcal{X}_{r-1}}[f(x^{\prime})-2B\sigma_% {r-1}(x^{\prime})]-\sup_{x^{\prime\prime}\in\mathcal{X}_{r-1}}[2B\sigma_{r-1}(% x^{\prime\prime})]$
	$\displaystyle\geq\sup_{x^{\prime}\in\mathcal{X}_{r-1}}f(x^{\prime})-\sup_{x^{% \prime\prime}\in\mathcal{X}_{r-1}}[4B\sigma_{r-1}(x^{\prime\prime})]$
	$\displaystyle\geq f(x^{*})-\sup_{x^{\prime\prime}\in\mathcal{X}_{r-1}}[4B% \sigma_{r-1}(x^{\prime\prime})].$	(29)

In the last line, we used Lemma B.3 to conclude $\sup_{x^{\prime}\in\mathcal{X}_{r-1}}f(x^{\prime})=f(x^{*})$ . Since $\mathcal{X}_{r}\subset\mathcal{X}_{r}^{\prime}$ , we can use Eqn. (29) to obtain an upper bound on $\varsigma_{r}$ as follows:

	$\displaystyle\varsigma_{r}$	$\displaystyle=f(x^{*})-\inf_{x\in\mathcal{X}_{r}}f(x)$
		$\displaystyle\leq f(x^{*})-\inf_{x\in\mathcal{X}_{r}^{\prime}}f(x)$
		$\displaystyle\leq f(x^{})-\left[f(x^{})-\sup_{x^{\prime}\in\mathcal{X}_{r-1}% }4B\sigma_{r-1}(x^{\prime})\right]$
		$\displaystyle\leq 4B\sup_{x^{\prime}\in\mathcal{X}_{r-1}}\sigma_{r-1}(x^{% \prime}).$

B.3.3 Proof of Lemma 4.6

We begin with the noiseless case. For brevity, we drop the subscript $0$ from the posterior variance corresponding to the noiseless case. Consider a kernel $k$ and let $\mathcal{H}$ and $\mathcal{H}^{\prime}$ denote the RKHS induced by $k$ on $\mathcal{X}$ and $\mathcal{X}^{\prime}$ . Since $\mathcal{X}^{\prime}\subset\mathcal{X}$ , it is straightforward to note that $\mathcal{H}^{\prime}\subseteq\mathcal{H}$ . Using the result from Wendland (2004, Theorem 10.46), we know that for every $f\in\mathcal{H}^{\prime}$ there exists a natural extension $\mathscr{E}f\in\mathcal{H}$ such that $\|\mathscr{E}f\|_{\mathcal{H}}=\|f\|_{\mathcal{H}^{\prime}}$ . Consequently, we can conclude $\{f:\|f\|_{\mathcal{H}^{\prime}}\leq 1\}\subseteq\{f:\|f\|_{\mathcal{H}}\leq 1\}$ . Lastly, note that $\mathcal{H}^{\prime}$ is same as the RKHS of the kernel $k^{\prime}(x,y)=k(\Gamma(x),\Gamma(y))$ over the domain $\mathcal{X}$ . Here $\Gamma$ denotes the bi-Lipschitian map $\Gamma:\mathcal{X}\to\mathcal{X}^{\prime}$ as given by Assumption 4.

Let $X\subset\mathcal{X}$ be any set of distinct points and $\sigma_{X}^{\prime}(x)$ and $\sigma_{X}(x)$ denote the posterior standard deviation at any point $x$ computed using the kernels $k^{\prime}$ and $k$ . Using the dual formulation of posterior variance, we have the following relation:

\displaystyle\sigma_{X}^{\prime}(x)=\sup_{\begin{subarray}{c}f\in\mathcal{H}^{% \prime}\\ \|f\|_{\mathcal{H}^{\prime}}\leq 1\\ f(X)=\{0\}\end{subarray}}f(x)\leq\sup_{\begin{subarray}{c}f\in\mathcal{H}\\ \|f\|_{\mathcal{H}}\leq 1\\ f(X)=\{0\}\end{subarray}}f(x)=\sigma_{X}(x).

In the above relation, we used the fact that $\mathcal{H}^{\prime}\subset\mathcal{H}$ and the unit ball in $\mathcal{H}^{\prime}$ is contained in the unit ball in $\mathcal{H}$ . This implies that the prediction made using the kernel $k^{\prime}$ has a smaller error than the prediction made by using kernel $k$ . If we set $X=\Gamma^{-1}(X^{\prime})$ ⁵⁵5For any operator $\Gamma$ and $X=\{x_{1},x_{2},\dots,x_{n}\}$ , we use the shorthand $\Gamma(X)$ for the set $\{\Gamma(x_{1}),\Gamma(x_{2}),\dots,\Gamma(x_{n})\}$ ., then the above is equivalent to saying that the prediction error using kernel $k$ corresponding to set of points $X^{\prime}\in\mathcal{X}^{\prime}$ is smaller than the prediction error using kernel $k$ corresponding to set of points $X\in\mathcal{X}$ .

Since the points $X^{\prime}$ are distributed uniformly in $\mathcal{X}^{\prime}$ , the points $X=\Gamma^{-1}(X^{\prime})$ are distributed according to density $\vartheta(x)=\frac{\det(\nabla\Gamma(x))}{\mathrm{vol}(\mathcal{X}^{\prime})}$ for all $x\in\mathcal{X}$ , where $\det(A)$ denotes the determinant of a matrix $A$ and $\nabla\Gamma$ denotes the Jacobian of $\Gamma$ . Note that $\nabla\Gamma$ (and hence the density $\vartheta$ ) is well-defined almost everywhere (a.e.) as a consequence of Rademacher’s theorem (Rudin, 1987, Chp. 7) and Lipschitz continuity of $\Gamma$ .

Let $\varrho_{\mathrm{unif}}$ denote the uniform distribution on $\mathcal{X}$ (i.e., the Lebesgue measure). We construct a (random) subset of $X$ , denoted by $Y$ , as follows. Each point $x_{i}$ for $i\in\{1,2,\dots,n\}$ is added into $Y$ independently of others with probability $c_{\vartheta}\frac{\varrho_{\mathrm{unif}}(x_{i})}{\vartheta(x_{i})}$ , where $c_{\vartheta}=\inf_{x}\frac{\vartheta(x)}{\varrho_{\mathrm{unif}}(x)}$ (where the infimum is taken over where $\vartheta$ is well defined). It is straightforward to note that the samples in $Y$ are distributed according to $\varrho_{\mathrm{unif}}$ . Using the Bernstein inequality for sum of Bernoulli random variables, we can conclude that $|Y|$ , the number of points in $Y$ satisfies the relation $|Y|\geq\frac{c_{\vartheta}n}{2C_{\vartheta}}$ with probability $1-\delta$ as long as $\frac{3c_{\vartheta}n}{16C_{\vartheta}}\geq\log(2/\delta)$ . Here $C_{\vartheta}=\sup_{x}\frac{\vartheta(x)}{\varrho_{\mathrm{unif}}(x)}$ . Since $Y\subseteq X$ , the prediction based on the values of $X$ is no worse than the prediction based on the values of $Y$ . Thus,

\displaystyle\sup_{x^{\prime}\in\mathcal{X}^{\prime}}\sigma_{X^{\prime}}^{2}(x% ^{\prime})

\displaystyle\leq\sup_{x\in\mathcal{X}}\sigma_{X}^{2}(x)\leq\sup_{x\in\mathcal% {X}}\sigma_{Y}^{2}(x)

An identical result holds for the noisy case using an identical series of arguments using the kernel $k_{\tau}(x,x^{\prime})=k(x,x)+\tau\delta_{x=x^{\prime}}$ Kanagawa et al. (2018), where $\delta_{x=x^{\prime}}$ denotes the dirac delta function. We can invoke the result from Theorem 3.1 for uniform samples on $\mathcal{X}$ to bound $\sigma_{Y}^{2}(x)$ under both the noisy and noiseless settings to obtain the following relations

	$\displaystyle\sup_{x^{\prime}\in\mathcal{X}^{\prime}}\sigma_{X^{\prime},\tau}^% {2}(x^{\prime})\leq\sup_{x\in\mathcal{X}}\sigma_{Y,\tau}^{2}(x)\leq\frac{C_{% \vartheta}}{c_{\vartheta}}\cdot\frac{216}{13}\cdot F^{2}\tau\cdot\frac{\gamma_% {n,\tau}}{n},$
	$\displaystyle\sup_{x^{\prime}\in\mathcal{X}^{\prime}}\sigma_{X^{\prime},0}^{2}% (x^{\prime})\leq\sup_{x\in\mathcal{X}}\sigma_{Y,0}^{2}(x)\leq\frac{C_{% \vartheta}}{c_{\vartheta}}\cdot\frac{216}{13}\cdot F^{2}\cdot n^{1-\beta}.$

We only need to obtain a bound the ratio $C_{\vartheta}/c_{\vartheta}$ that is independent of $n$ to complete the proof. Using the Lipschitzness of $\Gamma$ and $\Gamma^{-1}$ , we can conclude that

\displaystyle L_{f}^{\prime-d}\leq|\det(\nabla\Gamma)|\leq L_{f}^{d}.

Using the definition of $c_{\vartheta}$ , we have,

\displaystyle c_{\vartheta}=\inf_{x}\frac{\vartheta(x)}{\varrho_{\mathrm{unif}% }(x)}=\inf_{x}\frac{\det(\nabla\Gamma(x))\mathrm{vol}(\mathcal{X})}{\mathrm{% vol}(\mathcal{X}^{\prime})}\geq\frac{\mathrm{vol}(\mathcal{X})}{L_{f}^{\prime d% }\mathrm{vol}(\mathcal{X}^{\prime})}=\tilde{L_{f}}^{\prime-d}.

Similarly,

\displaystyle C_{\vartheta}=\sup_{x}\frac{\vartheta(x)}{\varrho_{\mathrm{unif}% }(x)}=\sup_{x}\frac{\det(\nabla\Gamma(x))\mathrm{vol}(\mathcal{X})}{\mathrm{% vol}(\mathcal{X}^{\prime})}\leq\frac{L_{f}^{d}\mathrm{vol}(\mathcal{X})}{% \mathrm{vol}(\mathcal{X}^{\prime})}=\tilde{L_{f}}^{d}.

Hence, $C_{\vartheta}/c_{\vartheta}\leq(\tilde{L}_{f}/\tilde{L}_{f}^{\prime})^{d}$ depends only on $(\tilde{L}_{f},\tilde{L}_{f}^{\prime})$ and is independent of $n$ , as required.

$\displaystyle\|g^{\top}(\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-% \mathbf{Id})g\|$	$\displaystyle=\|(\mathbf{P}g+\mathbf{Q}g)^{\top}(\mathbf{Z}^{-1/2}\hat{\mathbf{% Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})(\mathbf{P}g+\mathbf{Q}g)\|$
	$\displaystyle\leq\|(\mathbf{P}g)^{\top}(\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}% \mathbf{Z}^{-1/2}-\mathbf{Id})(\mathbf{P}g)\|+\|(\mathbf{Q}g)^{\top}(\mathbf{Z}^% {-1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})(\mathbf{Q}g)\|$
	$\displaystyle~{}~{}~{}~{}~{}~{}+\|(\mathbf{P}g)^{\top}(\mathbf{Z}^{-1/2}\hat{% \mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})+(\mathbf{Q}g)^{\top}(\mathbf{Z}^{-1/% 2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})(\mathbf{P}g)\|$
	$\displaystyle\leq\underbrace{\|g^{\top}\mathbf{P}(\mathbf{Z}^{-1/2}\hat{\mathbf% {Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})\mathbf{P}g\|}_{:=E_{1}}+\underbrace{\|g^{\top% }\mathbf{Q}(\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})% \mathbf{Q}g\|}_{:=E_{2}}$
	$\displaystyle~{}~{}~{}~{}~{}~{}+2\underbrace{\|g^{\top}\mathbf{P}(\mathbf{Z}^{-% 1/2}\hat{\mathbf{Z}}\mathbf{Z}^{-1/2}-\mathbf{Id})\mathbf{Q}g\|}_{:=E_{3}}.$	(10)

$\displaystyle E_{1}$	$\displaystyle=\|g^{\top}\mathbf{P}(\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^% {-1/2}-\mathbf{Id})\mathbf{P}g\|$
	$\displaystyle=\left\|(\mathbf{P}g)^{\top}\mathbf{P}\left(\sum_{i=1}^{n}(\mathbf% {L}^{1/2}\omega_{i})(\mathbf{L}^{1/2}\omega_{i})^{\top}-nL\mathbf{L}\right)% \mathbf{P}(\mathbf{P}g)\right\|$
	$\displaystyle=\left\|(\mathbf{P}g)^{\top}\left(\sum_{i=1}^{n}(\mathbf{P}\mathbf% {L}^{1/2}\mathbf{P}\omega_{i})(\mathbf{P}\mathbf{L}^{1/2}\mathbf{P}\omega_{i})% ^{\top}-n\mathbf{P}\mathbf{L}\mathbf{P}\right)(\mathbf{P}g)\right\|$
	$\displaystyle=n\left\|(\mathbf{P}g)^{\top}\mathbf{P}\mathbf{L}^{1/2}\mathbf{P}% \left(\frac{1}{n}\sum_{i=1}^{n}(\mathbf{P}\omega_{i})(\mathbf{P}\omega_{i})^{% \top}-\mathbf{P}\right)\mathbf{P}\mathbf{L}^{1/2}\mathbf{P}(\mathbf{P}g)\right\|$
	$\displaystyle\leq n\left\\|\left(\frac{1}{n}\sum_{i=1}^{n}(\mathbf{P}\omega_{i}% )(\mathbf{P}\omega_{i})^{\top}-\mathbf{P}\right)\right\\|_{2}\cdot\\|\mathbf{P}% \mathbf{L}^{1/2}\mathbf{P}(\mathbf{P}g)\\|_{\mathcal{H}_{k}}^{2}$
	$\displaystyle\leq n\left\\|\left(\frac{1}{n}\sum_{i=1}^{n}(\mathbf{P}\omega_{i}% )(\mathbf{P}\omega_{i})^{\top}-\mathbf{P}\right)\right\\|_{2}\cdot(g^{\top}% \mathbf{P}\mathbf{L}\mathbf{P}g)$
	$\displaystyle\leq\left\\|\left(\frac{1}{n}\sum_{i=1}^{n}(\mathbf{P}\omega_{i})(% \mathbf{P}\omega_{i})^{\top}-\mathbf{P}\right)\right\\|_{2}\cdot(n\\|\mathbf{L}% \\|_{2})\cdot\\|\mathbf{P}g\\|_{\mathcal{H}_{k}}^{2}.$	(11)

$\displaystyle E_{3}$	$\displaystyle=\|g^{\top}\mathbf{P}(\mathbf{Z}^{-1/2}\hat{\mathbf{Z}}\mathbf{Z}^% {-1/2}-\mathbf{Id})\mathbf{Q}g\|$
	$\displaystyle=\left\|g^{\top}\mathbf{P}\left(\sum_{i=1}^{n}(\mathbf{L}^{1/2}% \omega_{i})(\mathbf{L}^{1/2}\omega_{i})^{\top}-n\mathbf{L}\right)\mathbf{Q}g\right\|$
	$\displaystyle=\left\|\sum_{i=1}^{n}(g^{\top}\mathbf{P}\mathbf{L}^{1/2}\omega_{i% })(g^{\top}\mathbf{Q}\mathbf{L}^{1/2}\omega_{i})^{\top}\right\|$
	$\displaystyle=\left\|\sum_{i=1}^{n}\underbrace{(g^{\top}\zeta_{i})(g^{\top}\xi_% {i})}_{:=W_{i}}\right\|.$	(19)

$\displaystyle\|W_{i}\|$	$\displaystyle=\\|g\\|_{\mathcal{H}_{k}}^{2}\left(\frac{g^{\top}\zeta_{i}}{\\|g\\|_% {\mathcal{H}_{k}}}\right)\left(\frac{g^{\top}\xi_{i}}{\\|g\\|_{\mathcal{H}_{k}}}\right)$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}^{2}(\psi_{x_{i}}^{\top}\zeta_{i})(% \psi_{x_{i}}^{\top}\xi_{i})$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}^{2}\\|\zeta_{i}\\|_{\mathcal{H}_{k}}^{2% }\\|\xi_{i}\\|_{\mathcal{H}_{k}}^{2}$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}^{2}\cdot\left(\sum_{j=1}^{R}\frac{% \lambda_{j}}{n\lambda_{j}+\tau}\varphi_{j}^{2}(x_{i})\right)\cdot\left(\sum_{j% =R+1}^{\infty}\frac{\lambda_{j}}{n\lambda_{j}+\tau}\varphi_{j}^{2}(x_{i})\right)$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}^{2}\cdot\left(\frac{1}{n}\sum_{j=1}^{% R}\varphi_{j}^{2}(x_{i})\right)\cdot\left(\frac{1}{\tau}\sum_{j=R+1}^{\infty}% \lambda_{j}\varphi_{j}^{2}(x_{i})\right)$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}^{2}\cdot\frac{N(R)}{n}\cdot\frac{T(R)% }{\tau}.$	(21)

$\displaystyle E_{3}$	$\displaystyle=\left\|\sum_{i=1}^{n}(g^{\top}\zeta_{i})(g^{\top}\xi_{i})\right\|$
	$\displaystyle\leq\\|g\\|_{\mathcal{H}_{k}}\cdot\sqrt{2n\log\left(\frac{6}{\delta% }\right)\min\left\{\frac{\\|\mathbf{P}g\\|_{\mathcal{H}_{k}}^{2}}{n}\cdot\left(% \frac{T(R)}{\tau}\right)^{2},\frac{\lambda_{R+1}\\|\mathbf{Q}g\\|_{\mathcal{H}_{% k}}^{2}}{\tau}\cdot\left(\frac{N(R)}{n}\right)^{2}\right\}}$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{% }~{}+\\|g\\|_{\mathcal{H}_{k}}^{2}\cdot\frac{2N(R)}{3n}\cdot\frac{T(R)}{\tau}% \cdot\log\left(\frac{6}{\delta}\right).$	(23)