Random Exploration in Bayesian Optimization: Order-Optimal Regret and Computational Efficiency
Abstract
We consider Bayesian optimization using Gaussian Process models, also referred to as kernel-based bandit optimization. We study the methodology of exploring the domain using random samples drawn from a distribution. We show that this random exploration approach achieves the optimal error rates. Our analysis is based on novel concentration bounds in an infinite dimensional Hilbert space established in this work, which may be of independent interest. We further develop an algorithm based on random exploration with domain shrinking and establish its order-optimal regret guarantees under both noise-free and noisy settings. In the noise-free setting, our analysis closes the existing gap in regret performance and thereby resolves a COLT open problem. The proposed algorithm also enjoys a computational advantage over prevailing methods due to the random exploration that obviates the expensive optimization of a non-convex acquisition function for choosing the query points at each iteration.
1 Introduction
1.1 GP-based Bayesian Optimization
We consider the problem of sequential optimization of an unknown, possibly non-convex, function . The learner sequentially chooses a query point at each time and observes the function value (potentially subject to noise) at . The learning objective is to approach a global maximizer of the function through a sequence of query points chosen sequentially in time. In addition to the convergence of to , an online measure of the learning efficiency is the cumulative regret
(1) |
The above problem finds a wide range of applications including hyperparameter optimization Li et al. (2016), experimental design Greenhill et al. (2020), recommendation systems Vanchinathan et al. (2014) and robotics Lizotte et al. (2007). An approach that has proven to be particularly effective is Bayesian Optimization (BO) using Gaussian Process (GP) models (a.k.a. kernel-based bandit optimization). The unknown objective function is assumed to live in a Reproducing Kernel Hilbert Space (RKHS) associated with a known kernel. Within the GP-based BO framework, is viewed as a realization of a Gaussian process over . With each new query , the learner sharpens the posterior distribution and uses it as a proxy for for subsequent optimization. We point out that such a Bayesian approach is equally applicable to a frequentist formulation where is deterministic as considered in this work. In this case, the GP model of is fictitious and internal to the algorithm.
Under the assumption of noise-free query feedback, BO techniques were used for optimization as early as 1964 Kushner (1964). GP-based BO was popularized through the work of Močkus et al. (1978). Since then, a number of approaches have been developed and analyzed over the years, often under certain conditions on the kernels and functional characteristics around (see Sec. 1.3 for a detailed discussion). Surprisingly, despite the long history, an algorithm with guaranteed order-optimal regret performance remains open as discussed in Vakili (2022).
GP-based BO under noisy query was studied much more recently, following the pioneering work by Srinivas et al. (2010) where they proposed the celebrated GP-UCB algorithm. Extensive studies since then have fully characterized the achievable learning performance, both in terms of information-theoretic lower bounds Scarlett et al. (2017) and the design of algorithms such as SupKernel-UCB Valko et al. (2013), GP-ThreDS Salgia et al. (2021), BPE Li and Scarlett (2022), and RIPS Camilleri et al. (2021) that achieve the optimal performance.
Under both the noise-free and noisy settings, a key practical concern for GP-based algorithms is their computational cost. The major computational bottleneck of prevailing GP-based algorithms is the maximization of an acquisition function for choosing the query point at each time instant. The acquisition functions are often non-convex and computationally expensive to maximize. To achieve low regret order, such an optimization often needs to be carried out with increasing accuracy as time goes, resulting in a high overall computational requirement.
1.2 Main Results
We explore a new design methodology for GP-based BO: an open-loop exploration of the domain using query points sampled at random from an arbitrary probability distribution supported over the domain. We show that this random exploration approach, while simplistic in nature, leads to
order-optimal regret guarantees under both noise-free and noisy feedback models, thus closing the long standing regret gap in the noise-free setting. Moreover, the non-adaptive nature of random sampling bypasses the expensive step of optimizing a non-convex acquisition function, offering a computationally efficient solution without sacrificing learning efficiency.
Random exploration, while not new to many problems (see Sec. 1.3), has not been considered or analyzed for GP-based BO. It stands in sharp contrast to the prevailing exploratory query strategy in GP-based BO: the maximum posterior variance (MPV) sampling. Under MPV, the learning algorithm at each time queries the point with the highest posterior variance conditioned on past observations, i.e., a greedy approach to maximal uncertainty reduction. Surprisingly, we show that the simple, non-adaptive scheme of random exploration achieves the same order of predictive performance as MPV sampling, which is known to be order-optimal. In particular, we show that the worst-case posterior variance corresponding to randomly drawn points is bounded with high probability by and under noisy and noise-free feedback models, where is the maximal information gain from query points and is the order of the polynomial eigendecay of the kernel (see Sec. 2 for their definitions).
A simpler solution is often more demanding when it comes to establishing optimality in performance. The drastically different nature of random exploration from MPV demands different analytical techniques in characterizing its predictive performance. The tightest bound on the worst-case predictive error of MPV sampling, derived in Wenzel et al. (2021), was obtained using the results on scattered data interpolation (i.e., approximating an unknown function using a given set of points) of functions in Sobolev spaces that provide bounds on the worst-case estimation error of the best interpolant based on the fill distance of the given set of points Wendland (2004); Narcowich et al. (2006); Brenner et al. (2008); Arcangéli et al. (2012); Wenzel et al. (2021). Since RKHSs of Matérn kernels are norm-equivalent to Sobolev spaces, these results also immediately translate to estimation errors for function interpolation in RKHSs. The analytical techniques used in these studies require various technical assumptions on the regularity of the function domain and its boundary. These technical assumptions on the function domain present major challenges in incorporating MPV sampling with effective optimization techniques such as domain shrinking/elimination, hindering its potential applicability in designing algorithms with optimal regret. In contrast, in analyzing random exploration, we establish the concentration of the spectrum of the sample covariance operator to that of the true covariance operator that holds universally for all compact domains. The crux of our analysis builds upon a careful treatment of the infinite-dimensional operators to separately ensure the concentration of the initial spectrum (consisting of the larger eigenvalues) and the tail spectrum, which allows us to obtain optimal convergence rate. The simplicity of random exploration in its implementation and the generality in its guaranteed predictive performance as established in this work make this exploration strategy an attractive alternative to MPV. We believe that the tools and techniques established here are of independent interest for extending the methodology of random exploration to other problem fields.
Built upon the above key results on random exploration, we develop and analyze a new algorithm for GP-based BO. Referred to as Random Exploration with Domain Shrinking (REDS), this algorithm integrates the exploration strategy of random sampling with the optimization technique of domain shrinking Li and Scarlett (2022); Salgia et al. (2021). Under the noise-free feedback model, we show that REDS incurs a cumulative regret of , which closes the gap to the known lower bound established in Tuo and Wang (2020) and hence resolves the longstanding open problem. The generality of random exploration, both in terms of the design methodology and performance guarantee is the reason behind the optimal regret performance of REDS. In particular, the order-optimal predictive performance of random exploration that holds universally over all compact domain enables a seamless integration of this exploration strategy with domain shrinking. Similarly, in the noisy setting, we show that REDS offers a cumulative regret of , which is order-optimal up to logarithmic factors.
The computational advantage of REDS is evident due to the simplicity of random exploration. We further demonstrate this with empirical studies where we compare REDS with BPE Li and Scarlett (2022) and GP-ThreDS Salgia et al. (2021), all offering optimal regret performance. GP-ThreDS was shown to be computationally more efficient than prevailing algorithms such as GP-UCB. We show that REDS offers a significant speed-up in running time over both algorithms without compromising the regret performance. As shown in Table 1, REDS offers a and speed-up in runtime over GP-ThreDS and BPE, respectively.
1.3 Related Work
For GP-based BO with noise-free feedback, a number of algorithms such as GP-EI Močkus (1975), EGO Jones et al. (1998), knowledge-gradient policy Frazier et al. (2008), and GP-PI Kushner (1964); Törn and Žilinskas (1989); Jones (2001) have been proposed, which have since become classical. We refer the reader to the excellent tutorial by Brochu et al. (2010) for a more detailed description of the classical approaches. Despite their good empirical performance and popularity, theoretical guarantee on the convergence of these algorithms has only been established relatively recently. Vazquez and Bect (2010) showed that EI converges almost surely for any function drawn from a GP prior of finite smoothness. Grünewälder et al. (2010) established the convergence rate of a computationally infeasible version of EI. Later, Bull (2011) established convergence rates for the computationally feasible version, showing that GP-EI achieves the optimal simple regret for Matérn kernels with smoothness , which does not translate to optimal cumulative regret performance. More recently, De Freitas et al. (2012) proposed the Branch and Bound algorithm that achieves a constant cumulative regret in Bayesian setting under additional assumptions on the differentiability of the kernel and the behaviour around the unique global maximum, which in practice are difficult to verify. In contrast, REDS requires no such additional assumptions and is analyzed in the frequentist setting. Lyu et al. (2020) showed that for kernels with a polynomial eigendecay with parameter (See Definition 2.2), the GP-UCB algorithm achieves a regret of , which is sub-optimal, as shown in Vakili (2022).
The idea of using random sampling has been explored in related fields. The reconstruction of square integrable functions using random samples is a well-studied problem Bohn and Griebel (2017); Bastian Bohn (2017); Bohn (2018); Smale and Zhou (2004); Cohen et al. (2013); Chkifa et al. (2015); Cohen and Migliorati (2017). In particular, a series of studies considers efficient reconstruction of functions in RKHS using random samples drawn from the domain Kämmerer et al. (2021); Krieg and Ullrich (2021a, b); Moeller and Ullrich (2021). Despite certain similarities in the problem setup, an important point of distinction is that these studies focus on bounding the error of the reconstruction. In this work, we focus on bounding the sup-norm (or equivalently, norm) of the estimation error, which is larger than the norm and more challenging than bounding the norm. Since the analysis of algorithms requires a bound on the sup-norm of the estimation error, existing results are not applicable here.
2 Problem Statement
2.1 RKHS and Mercer’s Theorem
Let be a compact subset of and a finite Borel measure supported on . A measure is said to be supported on if for all open sets . For , this is equivalent to being absolutely continuous w.r.t. the Lebesgue measure. Let denote the Hilbert space of (real) functions defined over that are square-integrable w.r.t. 111To be rigorous, each represents the class of functions that are equivalent -everywhere..
Consider a positive definite kernel . A Hilbert space of functions on equipped with an inner product is called a Reproducing Kernel Hilbert Space (RKHS) with reproducing kernel if the following conditions are satisfied: (i) , ; (ii) , , . For simplicity, we use to denote . The inner product induces the RKHS norm, . WLOG, we assume that . For brevity, we drop the subscript of from the inner product for the rest of the paper.
Mercer’s Theorem provides an alternative representation for RKHSs through the eigenvalues and eigenfunctions of a kernel integral operator defined over using the kernel .
Theorem 2.1.
(Steinwart and Christmann, 2008, Theorem 4.49) Let be a compact metric space, be a continuous kernel and be a finite Borel measure supported on . Then, there exists an orthonormal system of functions in and a sequence of non-negative values satisfying , such that holds for all and the convergence is absolute and uniform over . Moreover, corresponds to the eigensystem of the kernel integral operator given by for all .
Consequently, the Mercer representation (Steinwart and Christmann, 2008, Thm. 4.51) of the RKHS of is given as
This also implies that with is an orthonormal basis for . The following definition characterizes a class of kernels based on their eigendecay profile corresponding to their Mercer representation.
Definition 2.2.
Let denote the eigenvalues of a kernel arranged in the descending order. The kernel is said to satisfy the polynomial eigendecay condition with a parameter if, for some universal constant , we have for all .
The above class of kernels encompasses a large number of kernels including the widely used Matérn family. We make the following assumption on the kernel which is commonly adopted in the literature Vakili et al. (2021b); Chatterji et al. (2019); Riutort-Mayol et al. (2023).
Assumption 2.3.
The eigenfunctions corresponding to are continuous and hence bounded on , i.e., there exists such that for all .
2.2 Problem Formulation
We consider the problem of optimizing a fixed and unknown function , where is a compact domain and with . A sequential optimization algorithm chooses a point at each time and observes . In the noise-free setting, for all . For the noisy setting, we assume that are independent, zero-mean, -sub Gaussian random variables for some fixed constant , i.e., , for all and . The performance of the sequential algorithm is measured using the notion of cumulative regret, as defined in Eqn. (1).
2.3 Preliminaries on Gaussian Processes
Under the GP model, the unknown function is treated hypothetically as a realization of , a Gaussian Process over with zero mean and as the covariance kernel. The noise terms are also viewed as zero mean Gaussian variables with variance . The conjugate property of GPs with Gaussian noise allows for a closed form expression of the posterior distribution. Specifically, let denote a collection of points and their corresponding observations obtained according to the model described in Sec. 2.2. Then, conditioned on , the posterior distribution of is also a GP with the following mean and covariance functions:
(2) | ||||
(3) |
where , , and is the identity matrix. The posterior variance at a point is given as . The expression for posterior mean and variance in the noise-free setting is simply obtained by setting in the above relations.
The posterior mean and variance computed using the GP model above are powerful tools to predict the values of the unknown function and to quantify the uncertainty in the prediction. In particular, the prediction error at a point , , can be upper bounded by , for a certain scaling factor that depends on the feedback model Vakili et al. (2021a).
Lastly, we define the information gain of a set of points as
(4) |
Similarly, we define the maximal information gain as . Maximal information gain is an important term that corresponds to the effective dimension of the kernel and helps characterize the regret of the algorithms. It depends only on the kernel and .
3 The Predictive Performance of Random Exploration
The following theorem characterizes the predictive variance, and consequently the predictive error, of a set of randomly sampled points from the domain.
Theorem 3.1.
Let be a compact subset of , be a finite Borel measure supported on , and be a continuous kernel satisfying the polynomial eigendecay condition with parameter (Defn. 2.2). Let denote a collection of i.i.d. points drawn from according to . Let and denote, respectively, the posterior variance conditioned on in the noise-free setting and the noisy setting with a noise variance of . Then, for a given , there exists a constant , such that, with probability at least , for all ,
The above obtained bounds on the worst-case posterior variance under the random exploration scheme are order-optimal (up to polylogarithmic factors), matching the existing lower bounds Scarlett et al. (2017); Tuo and Wang (2020). The above theorem also improves upon the best known results for noisy scattered data approximation. In particular, for the class of Matérn kernels with smoothness (i.e., ), Theorem 3.1 implies a worst-case predictive error of , improving upon the bound of established by Wynne et al. (2021, Corollary 3).
The constant is related to the kernel and measure through two fundamental functions, and , which are given as follows for any :
They are referred to as the spectral functions of the kernel (see Gröchenig (2020) and references therein) because of their dependence on the eigensystem corresponding to the kernel induced by the measure . Both and are fundamental quantities that appear in the analysis of reconstruction and estimation of functions in general spaces. The function corresponds to the inverse of the infimum of the Christoffel function Dunkl and Xu (2014) in the special case of reconstruction using orthogonal polynomials. Under Assumption 2.3 and the condition of polynomial eigendecay (Def. 2.2), can be shown to be bounded as . The dependence of on is mild, as evident from the previous expression. Lastly, is inversely proportional to . Note that Theorem 3.1 ensures that a smaller value of results in a tighter bound on the posterior variance, which in turn requires a larger number of samples. We refer the interested reader to the Appendix A for a more detailed discussion of and its dependence on and . For brevity, we drop the arguments and use the notation in the rest of the paper.
We provide a sketch of the proof of Theorem 3.1 below and refer the reader to Appendix A for a detailed proof.
Proof.
The main idea of the proof is to relate the worst-case posterior variance conditioned on to . This relation is established in two parts. In the first part, we establish that as the number of samples grow, the spectrum of random operator concentrates to that of , where are defined as follows:
where denotes the random ensemble of points drawn according to the measure . The concentration in spectral norm allows us to approximate the expression of as , i.e., by replacing the sample covariance operator, , with the true covariance operator, . Here, denotes the inverse of an operator , i.e., and denotes the identity operator. Thus, this step allows us to obtain a deterministic bound on posterior variance, which is easier to understand and analyze. We establish the required relation using the following two lemmas:
Lemma 3.2.
For all , the following relation holds with probability :
Lemma 3.3.
If the relation is true for some , then following is true :
Lemma 3.2 forms the cornerstone of the proof of the theorem. The result is established
by bounding the expression for an arbitrary with . We bound the above expression by decomposing it into a sum of three terms. Each of the three terms is then carefully bounded using a combination of Matrix-Chernoff inequality (Tropp, 2012, Theorem 1.1), a result for spectral norm concentration based on non-commutative Khinchtine inequality Buchholz (2001, 2005); Moeller and Ullrich (2021) and Bernstein inequality. Lemma 3.3 is established using a combination the structure of covariance matrices, the Cauchy-Schwarz inequality and the relation between the operator norm and -norm. We would like to emphasize that both the above lemmas are true in general for all eigendecay profiles and even without Assumption 2.3 being true.
In the second part, we show that, with high probability, the information gain of the (random) set is lower bounded by , upto a multiplicative constant. The above idea is formalized in the following lemma.
Lemma 3.4.
For all , the following relation holds with probability :
Thus serves as the bridge for connecting the posterior variance to maximal information gain.
The result for the noisy case follows immediately from the above lemmas by noting that . For the noise-free setting, the results do not carry forward immediately as the above analysis does not hold for . To circumvent this issue, we use the fact that is an increasing function of . Thus, we obtain a bound on by using the bound on , where is a carefully chosen value that not only allows us to use the analysis from the noisy case but also ensures that is a close representation of to guarantee tightest possible bounds. ∎
Remark 3.5.
We would like to emphasize that the above result holds for samples generated under every finite Borel measure supported on . However, the quality of the estimate changes with the choice of the measure through the leading constant in the bound in Theorem 3.1.
4 The REDS algorithm
In this section, we present the proposed algorithm and analyze its regret performance.
4.1 REDS with Noise-Free Feedback
REDS integrates random exploration with domain shrinking. It proceeds in epochs, maintaining an active region of the domain during each epoch . The sequence of active regions shrinks across epochs, i.e., , while ensuring for all with high probability. During the epoch, REDS samples points, uniformly at random from the set 222If consists of multiple disjoint regions, then we carry out this step for each region separately., where and the initial batch size is an input to the algorithm.
Using the observations from these points, REDS computes the posterior mean and variance function over , denoted by and respectively, using the Equations (2) and (3) with . The posterior mean and variance are then used to obtain , an improved localization of , as follows:
Here, and correspond to upper and lower bounds on the estimate of . A pseudocode for the algorithm is provided in Algorithm 1.
4.2 REDS under noisy feedback
The REDS algorithm can be extended to operate under noisy feedback with the following two minor modifications to Algorithm 1. First, the posterior mean and variance in each epoch should be computed using a noise variance (Line of Algorithm 1). Second, the upper and lower confidence bounds, i.e., UCB and LCB (Line of Algorithm 1), should be updated to the following:
(5) | ||||
(6) |
where , and is defined in Assumption 4.1.
4.3 Performance Analysis
For the analysis of the REDS algorithm, we need to make the following two additional assumptions.
Assumption 4.1.
For all , there exists a discretization of such that for all , and 333The notation is equivalent to for some ., where , is the point in that is closest to .
Assumption 4.2.
Let denote the level set of for . We assume that for all , is a disjoint union of at most components, each of which is closed and connected. Moreover, for each such component, there exists a bi-Lipschitzian map444We refer the reader to the supplementary material for additional details about the terms used in this assumption. between each such component and with normalized Lipschitz constant pair .
Assumption 4.1 is only required for the noisy case and is a standard assumption adopted in the literature. The existence of such a discretization has been justified and adopted in previous studies Srinivas et al. (2010); Chowdhury and Gopalan (2017); Vakili et al. (2021a); Salgia et al. (2022) and is a mild assumption on the kernel. Specifically, the popular class of kernels like Squared Exponential and Matérn kernels are known to be Lipschitz continuous, in which case a -cover of the domain with is sufficient to show the existence of such a discretization. Assumption 4 is an assumption on the regularity of the level sets of the function . The existence of a bi-Lipschitzian map between two sets implies topological similarity between the two sets. Intuitively, this assumption ensures that the shape of the level-sets is not “too arbitrary”. Note that such an assumption on the level sets of is relatively mild as the RKHS endows smoothness properties to the function which translate to a degree of topological regularity of level sets Alberti et al. (2011); Lee (2010).
The following theorem characterizes the regret performance of REDS under noise-free feedback.
Theorem 4.3.
Assume that the kernel satisfies the polynomial eigendecay condition with parameter and function satisfies Assumption 4. For a given , if REDS algorithm is run with and noise-free feedback, then the regret incurred by REDS satisfies,
with probability at least . Here, is a constant that depends only on and .
The following is an immediate corollary of the above theorem for the case of Matérn kernels.
Corollary 4.4.
Let be the Matérn kernel with smoothness . For a given , if REDS algorithm is run with under noise-free feedback on a function satisfying Assumption 4, then the regret incurred by REDS satisfies,
with probability at least . Here, is a constant that depends only on and .
This matches the result conjectured in Vakili (2022) upto logarithmic factors, resolving the open problem.
The following theorem characterizes the regret performance of REDS in the noisy feedback setting.
Theorem 4.5.
As shown by the above theorem, REDS achieves order-optimal regret (upto logarithmic factors) even under the noisy feedback model.
The proofs of both Theorems 4.3 and 4.5 follow a similar blueprint. A key aspect of both the proofs is to ensure that as Theorem 3.1 is invoked across the sets , the leading constant in Theorem 3.1, which has an implicit dependence on the domain through the constant , remains bounded and is independent of . The following lemma shows that for all functions satisfying Assumption 4, the leading constant only depends on the function and the initial domain.
Lemma 4.6.
Let be such that Assumption 4 holds. Let denote a path connected component of any level set of and be a set of points drawn uniformly at random from . Then for , the following relations holds with probability :
where and represent, respectively, the constants in Assumption 2.3 and Theorem 3.1 corresponding to the uniform measure on , and are constants that depend only on .
At a high level, the above lemma ensures that under the regularity condition on the topology of level sets (Assumption 4), Theorem 3.1 can be applied across level sets of by just paying the penalty of a constant that depends only on . The proof is based on the inclusion of RKHSs over subsets along with a change of measure argument. We refer the reader to Appendix B for a detailed proof of Lemma 4.6 and Theorems 4.3 and 4.5.
5 Empirical Studies
![Refer to caption](x1.png)
![Refer to caption](x2.png)
![Refer to caption](x3.png)
We compare the computational efficiency of REDS against algorithms with order-optimal regret performance, namely BPE (Li and Scarlett, 2022) and GP-ThreDS (Salgia et al., 2021) through an empirical study. We compare the regret performance and the running time of the three algorithms for three commonly used benchmark functions in Bayesian Optimization, namely, Branin (Azimi et al., 2012; Picheny et al., 2013), Hartmann-4D (Picheny et al., 2013) and Hartmann-6D (Picheny et al., 2013). The analytical expressions for the three benchmark functions are given as follows:
-
•
Branin function, denoted by , is defined over .
where and .
-
•
Hartmann-D function, denoted by , is defined over .
-
•
Hartmann-D function, denoted by , is defined over .
In the definitions above, denotes the element of the vector and and refer to the element of the matrices and , defined below:
For BPE and REDS, we consider a discretized version of the domain consisting of , and points chosen uniformly at random from the domain for the Branin, Hartmann-D and Hartmann-D functions respectively. We use the exponentially growing epoch schedule for both BPE and REDS as described in (Algorithm 1) for a fair comparison. We implement GP-ThreDS as described in Salgia et al. (2021). For each node in the tree, we consider a discretization, chosen uniformly at random, of size , and for the Branin, Hartmann-D and Hartmann-D functions respectively. The values of (the lower and upper bound on ) are set to , and for Branin, Hartmann-D and Hartmann-D respectively. We set for all experiments. The value of is set to across all experiments, except for BPE with Hartmann-D and Hartmann-D for which we set it to . These values are obtained using a grid search over in steps of . The parameter in REDS and BPE was set to for Branin and for Hartmann-D and Hartmann-D functions.
BPE | GP-ThreDS | REDS | |
---|---|---|---|
Branin | |||
Hartmann-4D | |||
Hartmann-6D |
For all the experiments, we used the Square exponential kernel. The length scale was set to for Branin and for Hartmann-D and Hartmann-D functions. We corrupted the observations with a zero mean Gaussian noise to the with a standard deviation of . All the algorithms were run for time steps. We recorded the cumulative regret and time taken by different algorithms for Monte Carlo runs for each benchmark function.
The regret for the algorithms over different functions is plotted in Figure 1. The shaded region represents the error bars upto standard deviation on either side. The running times, with an error bar of one standard deviation, are tabulated in Table 1. As evident from the plots in Figure 1, the regret incurred by REDS is comparable to that of other algorithms for all benchmark functions. At the same time, REDS offers about a and speedup in terms of runtime over the GP-ThreDS and BPE (See Table 1), demonstrating the practical benefits of our proposed methodology of random sampling.
6 Conclusion
In this work, we studied the methodology of exploring the domain using random samples drawn from a distribution supported on a compact domain. We showed that this non-adaptive approach offers the optimal-order of worst case predictive error for RKHS function in both noisy and noise-free feedback settings. The proposed approach offers a simple alternative for designing Bayesian Optimization algorithms which typically involve choosing points through a computationally expensive step of optimizing a non-convex acquisition function. Based on this methodology, we developed a algorithm that achieves order-optimal regret in both noisy and noise-free settings, resolving a COLT open problem. We demonstrated the computational advantage of the proposed approach through an empirical study, where the proposed algorithm achieved upto a runtime speed up over state-of-the-art algorithms.
References
- Alberti et al. (2011) G. Alberti, S. Bianchini, and G. Crippa. Structure of level sets and sard-type properties of lipschitz maps. Annali della Scuola Normale Superiore di Pisa. Classe di Scienze. Serie V, 4, 08 2011. doi: 10.2422/2036-2145.201107_006.
- Arcangéli et al. (2012) R. Arcangéli, M. C. López de Silanes, and J. J. Torrens. Extension of sampling inequalities to Sobolev semi-norms of fractional order and derivative data. Numerische Mathematik, 121(3):587–608, 2012. ISSN 0029599X. doi: 10.1007/s00211-011-0439-3.
- Azimi et al. (2012) J. Azimi, A. Jalali, and X. Z. Fern. Hybrid batch bayesian optimization. In Proceedings of the 29th International Conference on Machine Learning, ICML, volume 2, pages 1215–1222, 2012. ISBN 9781450312851.
- Bastian Bohn (2017) Bastian Bohn. Error analysis of regularized and unregularized least-squares regression on discretized function spaces. PhD thesis, Rheinische Friedrich-Wilhelms-Universität Bonn, 2017. URL https://hdl.handle.net/20.500.11811/7094.
- Bohn (2018) B. Bohn. On the convergence rate of sparse grid least squares regression. In Sparse Grids and Applications, pages 19–41. Springer International Publishing, 2018. ISBN 978-3-319-75426-0.
- Bohn and Griebel (2017) B. Bohn and M. Griebel. Error estimates for multivariate regression on discretized function spaces. SIAM Journal on Numerical Analysis, 55(4):1843–1866, 2017.
- Brenner et al. (2008) S. C. Brenner, L. R. Scott, and L. R. Scott. The mathematical theory of finite element methods, volume 3. Springer, 2008.
- Brochu et al. (2010) E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning, 2010.
- Buchholz (2001) A. Buchholz. Operator khintchine inequality in non-commutative probability. Mathematische Annalen, 319(1):1–16, 2001.
- Buchholz (2005) A. Buchholz. Optimal constants in khintchine type inequalities for fermions, rademachers and q-gaussian operators. Bulletin of The Polish Academy of Sciences Mathematics, 53:315–321, 2005. URL https://api.semanticscholar.org/CorpusID:55683104.
- Bull (2011) A. D. Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12:2879–2904, 2011. ISSN 15324435.
- Camilleri et al. (2021) R. Camilleri, J. Katz-Samuels, and K. Jamieson. High-Dimensional Experimental Design and Kernel Bandits. In Proceedings of the 38th International Conference on Machine Learning, ICML, 2021. URL https://arxiv.longhoe.net/abs/2105.05806v1http://arxiv.longhoe.net/abs/2105.05806.
- Chatterji et al. (2019) N. Chatterji, A. Pacchiano, and P. Bartlett. Online learning with kernel losses. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 971–980. PMLR, 2019.
- Chkifa et al. (2015) A. Chkifa, A. Cohen, G. Migliorati, F. Nobile, and R. Tempone. Discrete least squares polynomial approximation with random evaluations- application to parametric and stochastic elliptic pdes. ESAIM: Mathematical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique, 49(3):815–837, 2015.
- Chowdhury and Gopalan (2017) S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning, ICML, volume 2, pages 1397–1422, 2017. ISBN 9781510855144.
- Cohen and Migliorati (2017) A. Cohen and G. Migliorati. Optimal weighted least-squares methods. The SIAM journal of computational mathematics, 3:181–203, 2017.
- Cohen et al. (2013) A. Cohen, M. Davenport, and D. Leviatan. On the stability and accuracy of least squares approximations. Foundations of Computational Mathematics, 13:819–834, 2013.
- De Freitas et al. (2012) N. De Freitas, A. J. Smola, and M. Zoghi. Exponential regret bounds for Gaussian process bandits with deterministic observations. In Proceedings of the 29th International Conference on Machine Learning, ICML, volume 2, pages 1743–1750, 2012. ISBN 9781450312851.
- Dunkl and Xu (2014) C. F. Dunkl and Y. Xu. Orthogonal Polynomials of Several Variables. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 2 edition, 2014. doi: 10.1017/CBO9781107786134.
- Frazier et al. (2008) P. I. Frazier, W. B. Powell, and S. Dayanik. A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization, 47(5):2410–2439, 2008.
- Greenhill et al. (2020) S. Greenhill, S. Rana, S. Gupta, P. Vellanki, and S. Venkatesh. Bayesian optimization for adaptive experimental design: A review. IEEE access, 8:13937–13948, 2020.
- Gröchenig (2020) K. Gröchenig. Sampling, marcinkiewicz–zygmund inequalities, approximation, and quadrature rules. Journal of Approximation Theory, 257:105455, 2020.
- Grünewälder et al. (2010) S. Grünewälder, J.-Y. Audibert, M. Opper, and J. Shawe-Taylor. Regret bounds for gaussian process bandit problems. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 273–280, 2010.
- Jones (2001) D. R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of global optimization, 21:345–383, 2001.
- Jones et al. (1998) D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13:455–492, 1998.
- Kämmerer et al. (2021) L. Kämmerer, T. Ullrich, and T. Volkmer. Worst-case recovery guarantees for least squares approximation using random samples. Constructive Approximation, 54(2):295–352, 2021.
- Kanagawa et al. (2018) M. Kanagawa, P. Hennig, D. Sejdinovic, and B. K. Sriperumbudur. Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences, 2018.
- Krieg and Ullrich (2021a) D. Krieg and M. Ullrich. Function values are enough for l2-approximation. Foundations of Computational Mathematics, 21:1141–1151, 2021a. doi: https://doi.org/10.1007/s10208-020-09481-w.
- Krieg and Ullrich (2021b) D. Krieg and M. Ullrich. Function values are enough for l2-approximation: Part ii. Journal of Complexity, 66, 2021b. ISSN 0885-064X. doi: https://doi.org/10.1016/j.jco.2021.101569.
- Kushner (1964) H. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86:97–106, 1964.
- Lee (2010) J. Lee. Introduction to Topological Manifolds. Springer, 2010.
- Li et al. (2016) L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization, 2016.
- Li and Scarlett (2022) Z. Li and J. Scarlett. Gaussian process bandit optimization with few batches. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, AISTATS, 2022.
- Lizotte et al. (2007) D. J. Lizotte, T. Wang, M. H. Bowling, and D. Schuurmans. Automatic gait optimization with gaussian process regression. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI), volume 7, pages 944–949, 2007.
- Lyu et al. (2020) Y. Lyu, Y. Yuan, and I. W. Tsang. Efficient batch black-box optimization with deterministic regret bounds, 2020.
- Močkus (1975) J. Močkus. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pages 400–404, Berlin, Heidelberg, 1975. Springer Berlin Heidelberg. ISBN 978-3-540-37497-8.
- Moeller and Ullrich (2021) M. Moeller and T. Ullrich. L 2-norm sampling discretization and recovery of functions from rkhs with finite trace. Sampling Theory, Signal Processing, and Data Analysis, 19(2):13, 2021.
- Močkus et al. (1978) J. Močkus, V. Tiesis, and A. Žilinskas. Towards Global Optimization, volume 2, chapter The application of Bayesian methods for seeking the extremum, pages 117–129. Elsevier, 09 1978. ISBN 0-444-85171-2.
- Narcowich et al. (2006) F. J. Narcowich, J. D. Ward, and H. Wendland. Sobolev error estimates and a bernstein inequality for scattered data interpolation via radial basis functions. Constructive Approximation, 24:175–186, 2006.
- Ostrowski (1959) A. M. Ostrowski. A quantitative formulation of slyvester’s law of inertia. Proceedings of the National Academy of Sciences, 45(5):740–744, 1959. doi: 10.1073/pnas.45.5.740. URL https://www.pnas.org/doi/abs/10.1073/pnas.45.5.740.
- Picheny et al. (2013) V. Picheny, T. Wagner, and D. Ginsbourger. A benchmark of kriging-based infill criteria for noisy optimization. Structural and Multidisciplinary Optimization, 48(3):607–626, 2013. ISSN 1615147X. doi: 10.1007/s00158-013-0919-4. URL https://link.springer.com/article/10.1007/s00158-013-0919-4.
- Riutort-Mayol et al. (2023) G. Riutort-Mayol, P.-C. Bürkner, M. R. Andersen, A. Solin, and A. Vehtari. Practical hilbert space approximate bayesian gaussian processes for probabilistic programming. Statistics and Computing, 33(1):17, 2023.
- Rudin (1987) W. Rudin. Real and complex analysis, 3rd ed. McGraw-Hill, Inc., USA, 1987. ISBN 0070542341.
- Salgia et al. (2021) S. Salgia, S. Vakili, and Q. Zhao. A domain-shrinking based Bayesian optimization algorithm with order-optimal regret performance. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems, volume 34, 2021.
- Salgia et al. (2022) S. Salgia, S. Vakili, and Q. Zhao. Collaborative Learning in Kernel-based Bandits for Distributed Users, 2022.
- Scarlett et al. (2017) J. Scarlett, I. Bogunovic, and V. Cehver. Lower Bounds on Regret for Noisy Gaussian Process Bandit Optimization. In Conference on Learning Theory, volume 65, pages 1–20, 2017.
- Smale and Zhou (2004) S. Smale and D.-X. Zhou. Shannon sampling and function reconstruction from point values. Bulletin of The American Mathematical Society, 41:279–306, 2004. doi: 10.1090/S0273-0979-04-01025-0.
- Srinivas et al. (2010) N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, ICML, pages 1015–1022, 2010. ISBN 9781605589077. doi: 10.1109/TIT.2011.2182033.
- Steinwart and Christmann (2008) I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008. doi: https://doi.org/10.1007/978-0-387-77242-4.
- Törn and Žilinskas (1989) A. Törn and A. Žilinskas. Global Optimization. Springer Berlin, Heidelberg, 1989.
- Tropp (2012) J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12:389–434, 2012.
- Tuo and Wang (2020) R. Tuo and W. Wang. Kriging prediction with isotropic matérn correlations: Robustness and experimental designs. The Journal of Machine Learning Research, 21(1):7604–7641, 2020.
- Vakili (2022) S. Vakili. Open problem: Regret bounds for noise-free kernel-based bandits. In Proceedings of 35th Conference on Learning Theory (COLT), volume 178, pages 5624–5629, 2022.
- Vakili et al. (2021a) S. Vakili, N. Bouziani, S. Jalali, A. Bernacchia, and D.-s. Shiu. Optimal order simple regret for Gaussian process bandits. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems, 2021a.
- Vakili et al. (2021b) S. Vakili, K. Khezeli, and V. Picheny. On information gain and regret bounds in Gaussian process bandits. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, AISTATS, 2021b.
- Valko et al. (2013) M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, UAI, pages 654–663, 2013.
- Vanchinathan et al. (2014) H. P. Vanchinathan, I. Nikolic, F. De Bona, and A. Krause. Explore-exploit in top-n recommender systems via gaussian processes. In Proceedings of the 8th ACM Conference on Recommender Systems, pages 225–232, 2014.
- Vazquez and Bect (2010) E. Vazquez and J. Bect. Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. Journal of Statistical Planning and Inference, 140(11):3088–3095, 2010. ISSN 0378-3758. doi: https://doi.org/10.1016/j.jspi.2010.04.018.
- Wasserman (2008) L. Wasserman. Lecture notes on statistical methods for machine learning, 2008. URL https://www.stat.cmu.edu/~larry/=sml/Concentration.pdf.
- Wendland (2004) H. Wendland. Scattered Data Approximation. Cambridge University Press, 2004. doi: 10.1017/CBO9780511617539.
- Wenzel et al. (2021) T. Wenzel, G. Santin, and B. Haasdonk. A novel class of stabilized greedy kernel approximation algorithms: Convergence, stability and uniform point distribution. Journal of Approximation Theory, 262, 2021. ISSN 10960430. doi: 10.1016/j.jat.2020.105508.
- Wynne et al. (2021) G. Wynne, F.-X. Briol, and M. Girolami. Convergence guarantees for gaussian process means with misspecified likelihoods and smoothness. The Journal of Machine Learning Research, 22(1):5468–5507, 2021.
Appendix A Proof of Theorem 3.1
We begin with setting up some notation that will be used throughout the proof. Throughout the appendix, we will represent the elements of as infinite dimensional vectors and operators over these function spaces as infinite dimensional matrices. We adopt such a convention for ease for presentation while kee** in mind that despite the matrix representation, the actual operation is over elements of . Recall that we defined the sample covariance operator for a randomly chosen sample and its expected value as follows for any :
In the matrix-vector notation, the operators (equivalently, matrices) are given as:
where is the identity matrix (operator) and is the diagonal matrices consisting of the eigenvalues of the kernel corresponding to the measure . If we define , then we can also write . Consequently, the posterior variance at any point is given as:
For any , we define the following two quantities that will be relevant during our analysis:
(7) | ||||
(8) |
Recall that are eigenfunctions of the kernel operator and form an orthonormal system in and are an orthonormal basis for . The term is often referred to as the spectral function (see Gröchenig (2020) and references therein) and in case of orthogonal polynomials, it is the inverse of the infimum of the Christoffel function Dunkl and Xu (2014). Both and are fundamental quantities that appear in the analysis of reconstruction and estimation of functions.
Lastly, based on and , for a given kernel , measure and , we define the following terms for any and :
The dependence on and is implicit through and used to define and . For brevity of notation, going forward, we drop the explicit description of dependence on and .
We are now ready to prove the theorem. We first prove the statement of the theorem, assuming that the lemmas hold, followed by the proofs of the lemmas.
We begin with result for the noisy case, where is fixed (independent of ). From Lemma 3.2, we know that for , holds with probability . Using this result along Lemma 3.3, we can conclude that holds for all . Thus, we have,
(9) |
as required. The third line in the above expression follows from Lemma 3.4. We would like to emphasize that the polynomial eigendecay condition is not necessary to obtain the above relation. It is only necessary to bound the information gain in terms on . Under the polynomial eigendecay condition with parameter , the above equation can also be written as
where we used the bound on information gain from Vakili et al. (2021b, Corollary 1) and is an appropriately chosen constant independent of and .
We now consider the noise-free case. Since information gain is only defined for , we cannot directly extend the analysis as used in the noisy case by substituting . To circumvent this issue, we carefully choose , such that is a close representation of . We choose to be dependent on such that goes to as becomes larger. This allows to faithfully represent the value of over the range of . Specifically, we choose for , where is the constant in Assumption 2.3. The condition on constant ensures that exists. Since all conditions of the analysis for (noisy case) are satisfied, we can directly invoke the result for . Using the bound on and the monotonicity of as a function of , we obtain,
where is a constant independent of .
A.1 Proof of Lemma 3.2
Since we are interested in bounding the -norm of the operator , we will focus on finding an upper bound on that holds uniformly for all functions in the unit ball in RKHS, i.e., . The high level idea is to
separately consider the contribution of component of that belongs to the subspace spanned by eigenfunctions corresponding to the “large” eigenvalues, i.e., head of the spectrum and those corresponding to the “small” eigenvalues, i.e., tail of the spectrum.
Throughout the proof, we fix a . The existence of such an is guaranteed by the assumption . For the analysis, we define two projection operators, and . We define as the projection operator onto the subspace spanned by , i.e., for any , . Note that is an orthogonal projection operator. Similarly, we define .
We also introduce some additional notation for the ease of presentation. We define to be the diagonal matrix (operator) whose entry is . Similarly, let for . Using this notation, we can rewrite the matrix as
For any , we have the following decomposition:
(10) |
We separately bound the terms and , beginning we . We have,
(11) |
In the above equations, we used the fact that for any diagonal matrix , and that . Firstly, note that . Consequently, . Secondly, to bound the first term on the RHS, we denote for all . We have, . Moreover, for all ’s, only the top sub-matrix has non-zero entries, implying it is sufficient to bound the -norm of that finite sub-matrix to bound the first term on the RHS. We use Matrix-Chernoff inequality (Tropp, 2012, Theorem 1.1) to bound the -norm of this finite dimensional submatrix.
For all , let denote the -dimensional vector corresponding to the first coordinates of . Thus, we are interested in applying the Matrix-Chernoff inequality to bound the following expression:
where denotes the dimensional identity matrix. Here, we used the fact that the relevant sub-matrix of , or equivalently , corresponds to . To invoke the Matrix-Chernoff inequality, we need bounds on the maximum and minimum eigenvalue of and a bound on that holds almost surely for all . Since , implying that both the maximum and minimum eigenvalues are . For any , we have,
On invoking the Matrix-Chernoff inequality with these results, we obtain that the following relation is true with probability :
(12) |
On combining the above bound with Eqn. (11) along with noting that , we can conclude that:
(13) |
We would like to mention that the above bound is only valid when the RHS in Eqn. (12) is less than . However, this condition is satisfied by the choice of .
We now consider the second term, . We have,
(14) | ||||
(15) |
Note that the term has a similar structure as except for the fact that involves infinite-dimensional vectors as opposed to finite-dimensional vectors. Thus, to bound we use a result from Moeller and Ullrich (2021, Proposition 3.8) which is spectral concentration inequality for infinite-dimensional vectors derived using non-commutative Khinchtine inequality Buchholz (2001, 2005); Moeller and Ullrich (2021). From Proposition in Moeller and Ullrich (2021), we can conclude that the following relation holds with probability at least :
(16) |
where and . We can further bound the terms and as follows.
On plugging this into Eqn. (16), we obtain the following bound on .
(17) |
Combining Eqn. (15) and (17) yields us,
(18) |
We now move onto the third term, , which contains the cross terms. For brevity of notation, we define and for all . Note that for all . Since and commute with , a diagonal matrix, it is straightforward to note that . Using this relation along with the definition of and , we can rewrite as follows:
(19) |
We use Bernstein inequality to bound the sum of the random variables , for which we need the values of , and an upper bound on that holds almost surely. We begin with . We have,
(20) |
For an upper bound on , note that for any with , is maximized for the choice of . Thus,
(21) |
From the above expressions, we can also conclude that and . We use these relations to obtain a bound on . We have,
(22) |
In the last step, we used the bounds on and derived in the earlier part of the proof. Lastly, since , . On applying Bernstein inequality (Wasserman, 2008, Lemma 7.37) using the relations from Eqns. (20), (21) and (22), we can conclude that the following relation holds with probability :
(23) |
On plugging in any value of and using the definition of along with the relation , we can conclude that with probability at least . The overall probability on the bound is obtained using a union bound for the relations on , and .
A.2 Proof of Lemma 3.3
We begin the proof by showing that we can relate the to through the operator norm of . Specifically, we show if that operator norm of is small, then and are within a constant factor of each other. Lastly, we use the condition on to bound the , the operator norm of , to obtain the required result.
We begin with considering the following expression.
(24) |
Consider the scenario where the relation is satisfied for some . We claim that under this scenario, we have, . To show this claim, we consider Eqn. (24). If , the claim follows immediately. For the other case, we have,
as claimed. Thus, it suffices to show that is small.
To that effect, note that we can write the operator as where, . Consequently, using the definition of operator norm yields us,
(25) |
From the definition of , we have , from the given statement in the Lemma. Note that if for some , then all eigenvalues of lie in the interval . This implies that all the eigenvalues of lie in the interval . Hence, . On combining this with Eqn. (25), we can conclude that if , then . On combining this with the previous claim that relates to through , we arrive at the result.
A.3 Proof of Lemma 3.4
Similar to the analysis in Appendix A.1, we fix an and define projection matrices and using the value of as defined in Appendix A.1. We define the projection of the kernel operator on the subspaces spanned by and as follows:
Recall that denotes the information gain corresponding to the randomly drawn set of points . Similar to , we also define and as and . It is straightforward to note that .
We first derive some auxiliary results on the spectrum of and which will be useful in the analysis later. Recall that we defined . We can also rewrite , and in terms of as: , and . Using this relation, note that the singular values of and are the same as that of and respectively.
For the spectrum of , note that
If denote the eigenvalues of , then using Ostrowski’s Theorem Ostrowski (1959), we can conclude that for all , where correspond to the eigenvalues of and lie between the smallest and largest eigenvalues of the matrix . Note that the singular values (in this case, also eigenvalues) of are the same as that of , where , as defined in Appendix A.1. Using Eqn. (12) and that and , we can conclude that the following relation is true with probability :
Thus, we can conclude that eigenvalues of lie in the range and consequently, .
As mentioned earlier, the singular values of are the same as those of . For the analysis, it suffices to have an upper bound on , or equivalently, . Using the result from Moeller and Ullrich (2021, Proposition 3.8), we know that the following relation holds with probability :
Since , we can conclude that . We are now ready to prove the lemma.
Using the relation , we can decompose the information gain of as follows:
This decomposition is similar to that derived in Vakili et al. (2021b, App. A, Eqn. 8) with the roles of and interchanged.
We begin with . Since , all eigenvalues of are less than . Using the relation , which holds for all , we can lower bound as follows:
Note are i.i.d. random variables with and . We can thus use Hoeffding inequality to obtain the following bound on which holds with probability at least :
In the third line, we used the fact that since for all (Assumption 2.3). The fourth line uses the condition that .
To bound , first note that using the condition on the spectrum on , we can conclude that all the eigenvalues of lie in the range . Moreover, note that the spectrum of is the same as that of . On using Ostrowski’s Theorem Ostrowski (1959) along with range of eigenvalues of , we can conclude that
Using the relation for the eigenvalues of derived earlier, we can further as follows:
In the fourth line, we used the relation , which holds for all .
On combining the bounds for and , we obtain,
as required. Since each of the bounds on and the eigenvalues of and , holds with probability at least , the overall bound holds with probability at least .
Appendix B Proof of Theorems 4.3 and 4.5
The proof of both the theorems is based along the lines of the proof of the Batched Pure Exploration (BPE) algorithm Li and Scarlett (2022). We first begin with a brief discussion about Assumption 4 and then move on to the proof.
Definition B.1.
Let be a map between two sets . We call to be a bi-Lipschitz map if the inverse map, , exists and the following relations hold for some :
We refer to the Lipschitz constant pair of . We also define normalized Lipschitz constant pair of to be the pair .
The normalized Lipschitz constant pair quantifies solely the change due to structure and discounts for the change in size between and . The following is a restatement of Assumption 4.
Assumption B.2.
Let denote the level set of for . Then,
-
•
For all , is a disjoint union of at most closed, path connected components.
-
•
For a given , let denote the such connected component of . We assume that there exists a bi-Lipschitzian map with normalized Lipschitz constant pair for all . Let and . We assume that .
Assumption 4 is an assumption on the regularity of the level sets of the function . The term can be thought of as the number of local maximas of and hence finiteness of is a mild assumption on satisfied by functions encountered in practice. Moreover, the knowledge of is only required for analysis and not for the algorithm to run. The second condition on is to ensure that the these connected components are topologically regular enough and to avoid certain pathological cases. In particular, the existence of a bi-Lipschitzian map between two sets implies topological similarity between the two sets. Intuitively, this assumption ensures that the shape of the level-sets is not “too arbitrary”. Note that such an assumption on the level sets of is relatively mild as the RKHS endows smoothness properties to the function which translate to a degree of topological regularity of level sets Alberti et al. (2011); Lee (2010).
B.1 Proof of Theorem 4.3
At a high level, the bound on regret is obtained by first separately bounding the regret during every epoch and then summing it across all epochs. During any epoch , since REDS chooses points uniformly at random from the current domain , we simply bound the regret incurred at each point queried during this epoch by the worst case scenario, i.e., . This leads to an upper bound of on the regret incurred during epoch , as there are at most connected components in each level set. Since poorly performing regions of the domain are eliminated as the algorithm proceeds, gets closer to , reducing the regret in each epoch as the algorithm proceeds.
The following two lemmas ensure the correctness of the algorithm and help bound the regret incurred during each epoch.
Lemma B.3.
for all .
Lemma B.4.
For all epochs , we have,
We defer the proof of these lemmas to Appendix B.3. Equipped with these lemmas, we move on to the proof of Theorem 4.3. The regret incurred by REDS can be bounded as
In the above expression, denotes the total number of epochs that begin during a run of REDS algorithm before reaching a total of queries. Since the epoch lengths double every epoch, we have . We can further bound using Lemma 4.6 (which in turn is based on Theorem 3.1) to bound the worst-case posterior standard deviation in the above equation. Since is compact ( is closed by definition and is bounded because ) and , we can invoke Lemma 4.6 to conclude
(26) |
where , and are the constants from Lemma 4.6 and depend only on . For simplicity, we define , as a constant that depends only on the function . On plugging in the values of , Eqn. (26) simplifies to
(27) |
We consider three separate cases based on the value of :
On combining all the cases, we arrive at the result. The statement in Corollary 4.4 follows immediately from the above proof by plugging in .
B.2 Proof of Theorem 4.5
The proof of Theorem 4.5 is almost identical to that of Theorem 4.3. The following lemma is a counterpart to Lemma B.4 for the noisy case.
Lemma B.5.
For all epochs , the following relation holds with probability at least :
The proof of this lemma is identical to that of Lemma B.4 with the definitions of and changed according to the noisy setup (See Vakili et al. (2021a) for an exact derivation). On using Lemma 4.6 (for the noisy case) along with Lemma B.5, we can rewrite Eqn. (26) as
(28) |
where second line follows using monotonicity of i.e., for all and is the leading constant in Eqn. (9). On plugging in the values of in Eqn. (28), we obtain,
where as before. Hence, satisfies , as required.
B.3 Proof of Auxiliary Lemmas
B.3.1 Proof of Lemma B.3
The main ingredient in the proof is the relation: , which holds for all and across all epochs . This is a well-known relation in the literature Vakili et al. (2021a); Lyu et al. (2020) that bounds the predictive performance of the posterior mean in terms of posterior variance.
We use induction to prove the lemma. Since and holds by definition, . Assume that . Using the relation , we can conclude,
where we used the inductive hypothesis to establish . This implies that , as required.
B.3.2 Proof of Lemma B.4
We separately show the bounds for and . For the first epoch, we have,
We used the fact that . Consider any epoch . For the analysis, we define
The region satisfies . To establish this, we once again employ the relation . Using the relation, we can conclude that
The inclusion follows immediately from the definition of and and the above expressions.
B.3.3 Proof of Lemma 4.6
We begin with the noiseless case. For brevity, we drop the subscript from the posterior variance corresponding to the noiseless case. Consider a kernel and let and denote the RKHS induced by on and . Since , it is straightforward to note that . Using the result from Wendland (2004, Theorem 10.46), we know that for every there exists a natural extension such that . Consequently, we can conclude . Lastly, note that is same as the RKHS of the kernel over the domain . Here denotes the bi-Lipschitian map as given by Assumption 4.
Let be any set of distinct points and and denote the posterior standard deviation at any point computed using the kernels and . Using the dual formulation of posterior variance, we have the following relation:
In the above relation, we used the fact that and the unit ball in is contained in the unit ball in . This implies that the prediction made using the kernel has a smaller error than the prediction made by using kernel . If we set 555For any operator and , we use the shorthand for the set ., then the above is equivalent to saying that the prediction error using kernel corresponding to set of points is smaller than the prediction error using kernel corresponding to set of points .
Since the points are distributed uniformly in , the points are distributed according to density for all , where denotes the determinant of a matrix and denotes the Jacobian of . Note that (and hence the density ) is well-defined almost everywhere (a.e.) as a consequence of Rademacher’s theorem (Rudin, 1987, Chp. 7) and Lipschitz continuity of .
Let denote the uniform distribution on (i.e., the Lebesgue measure). We construct a (random) subset of , denoted by , as follows. Each point for is added into independently of others with probability , where (where the infimum is taken over where is well defined). It is straightforward to note that the samples in are distributed according to . Using the Bernstein inequality for sum of Bernoulli random variables, we can conclude that , the number of points in satisfies the relation with probability as long as . Here . Since , the prediction based on the values of is no worse than the prediction based on the values of . Thus,
An identical result holds for the noisy case using an identical series of arguments using the kernel Kanagawa et al. (2018), where denotes the dirac delta function. We can invoke the result from Theorem 3.1 for uniform samples on to bound under both the noisy and noiseless settings to obtain the following relations
We only need to obtain a bound the ratio that is independent of to complete the proof. Using the Lipschitzness of and , we can conclude that
Using the definition of , we have,
Similarly,
Hence, depends only on and is independent of , as required.