1 Introduction

Pseudo-Bayesian Optimization

Haoxian Chen,¹ Henry Lam¹

¹ Department of Industrial Engineering and Operations Research, Columbia University

Abstract

Bayesian Optimization is a popular approach for optimizing expensive black-box functions. Its key idea is to use a surrogate model to approximate the objective and, importantly, quantify the associated uncertainty that allows a sequential search of query points that balance exploitation-exploration. Gaussian process (GP) has been a primary candidate for the surrogate model, thanks to its Bayesian-principled uncertainty quantification power and modeling flexibility. However, its challenges have also spurred an array of alternatives whose convergence properties could be more opaque. Motivated by these, we study in this paper an axiomatic framework that elicits the minimal requirements to guarantee black-box optimization convergence that could apply beyond GP-based methods. Moreover, we leverage the design freedom in our framework, which we call Pseudo-Bayesian Optimization, to construct empirically superior algorithms. In particular, we show how using simple local regression, and a suitable “randomized prior” construction to quantify uncertainty, not only guarantees convergence but also consistently outperforms state-of-the-art benchmarks in examples ranging from high-dimensional synthetic experiments to realistic hyperparameter tuning and robotic applications.

1 Introduction

Bayesian Optimization (BO) is a popular method for the global optimization of black-box functions that are typically multimodal and expensive to evaluate. Its main idea is to model the objective landscape using observed data in order to “guess” where the global optimum is, rather than following a solution trajectory. One of the earliest works in BO is [31] in 1998, initially proposed for automotive and semiconductor designs and motivated by long runtimes of their computer codes. Nowadays, BO has been widely applied in machine learning including hyper-parameter tuning and Auto-ML [38], reinforcement learning [12], robotics [42], experimental design such as A/B testing [13], simulator calibration [56, 3], and engineering applications such as environmental monitoring [41] and aerospace system design [27]. Extensive overviews are in [10, 58, 22].

More precisely, a typical BO procedure sequentially searches for next design points to evaluate by updating a surrogate model to predict the objective and, moreover, quantify the associated uncertainty. Both the prediction and the uncertainty quantification are important as they work together, via a so-called acquisition function, to balance exploitation-exploration in the evaluation sequence. In this regard, Gaussian process (GP) has been a primary candidate of surrogate model, thanks to its Bayesian-principled uncertainty quantification power and modeling flexibility [30, 49]. However, it has a cubic-order scalability due to the inversion of the Gram kernel matrix in computing the posterior and, despite its principled design, it is conceivable to construct algorithms that could leverage problem structures more efficiently. To this end, there has been a surge of active studies to improve BO. These include the sparsification of GP using pseudo-inputs [55, 35, 59, 43, 28, 17] or kernel approximation [36] to reduce its time complexity, substitution of GP with other surrogates such as random forests [29] and neural networks [62, 60, 51, 71], and most recently the direct modeling of acquisition functions over the search space instead of through a surrogate [7, 65, 61].

Most of the state-of-the-art procedures mentioned above (and expanded further in Section 2) are practically motivated. In other words, they are shown to be superior via empirical comparisons, and designed via nicely intuited but ad hoc ideas. At the same time, the theory of BO has been largely confined to GP-based procedures. Among them, algorithmic consistency, i.e., convergence to the true optimum as the algorithm evolves indefinitely, can be attained with broader assumptions and algorithmic structures (but still GP-based; e.g., [40, 68, 21, 6]). Convergence rate results, while giving stronger conclusion, typically require more opaque assumptions, such as those associated with reproducing kernel Hilbert space (RKHS) and probability tail decays of the gradients of GPs, as well as more algorithmic restrictions such as upper confidence bound (UCB) instead of general acquisition functions [63, 32, 15]. In either case, there appears to be a mismatch between theory and practice: On one hand, practically superior algorithms do not have guarantees; on the other hand, BO theory is largely confined to GP and is not broad enough to cover more practical algorithms.

Given the above, our first high-level goal of this paper is to build a theory for exploration-based optimization that encompasses general algorithms beyond GP. This makes a step forward to reduce the gap between theory and practice in the realm of BO. To clarify our scope, by theory here we focus on algorithmic consistency – While this is weaker than rate results, algorithmic consistency is still largely open for procedures beyond GP, and moreover avoids the opaque assumptions and restrictive settings encountered in rate analysis. With this, our second high-level goal is to leverage our theory to locate strongly performing algorithms. Here, by strongly performing we mean algorithms that are empirically competitive against state-of-the-art benchmarks, while simultaneously have theoretical guarantees.

With these goals in mind, our main contribution is an axiomatic framework that elicits the minimal requirements on any exploration-based algorithms to guarantee convergence. Specifically, we dissect an exploration-based algorithm into three independent basic ingredients, surrogate predictor (SP), uncertainty quantifier (UQ), and acquisition function (AF). SP aims to provide point predictions at different design points. UQ quantifies the uncertainty of SP and indicates how reliable is the current prediction. AF transforms SP and UQ into the decision on which design point to evaluate next. We derive the axiomatic properties of SP, UQ and AF to attain theoretical convergence. In a nutshell, we call these properties local consistency, the sequential no-empty-ball property, and the improvement property respectively. These basic ingredients, along with their axiomatic properties, appear in GP-based algorithms – however, and as our key message, GP is not the only approach that exhibits these properties; instead, there are many more algorithms that could lead to similar convergence guarantees. We call our above framework Pseudo-Bayesian Optimization (PseudoBO). This is because, intuitively, we hinge on the Bayesian insight of BO in dissecting convergent procedures, but at the same time, a Bayesian perspective is generally not required for optimization convergence. Essentially, our framework extracts the minimally needed features in BO from the view of exploration-based optimization.

Our PseudoBO framework creates a recipe, with a list of SP, UQ and AF candidates that can be shown to satisfy the respective properties and hence altogether lead to algorithmic convergence. Our next endeavor is to assemble candidates in this recipe that perform competitively against existing benchmarks in empirical experiments. This resulting algorithm would then be one that exhibits both empirical and theoretical advantages over the benchmarks. In particular, we show how combining a simple local regression as SP, a suitably constructed “randomized prior” as UQ, and expected improvement (EI) as AF, can consistently outperform some state-of-the-art BO benchmarks across examples ranging from high-dimensional synthetic experiments to realistic hyperparameter tuning and robotic applications.

The remainder of this paper is as follows. Section 2 reviews related works on both practical and theoretical fronts. Section 3 presents the general theory of PseudoBO. Section 4 compiles the recipe that comprises a range of exemplifying ingredients under the PseudoBO framework. Section 5 discusses some details in turning our theory into implementation. Section 6 presents our empirical performances and comparisons with benchmark approaches. Section 7 concludes our paper and discusses future directions. All proofs and additional numerical details are presented in the Appendix.

2 Related Works

We review the existing literature. We categorize it roughly into two parts, one focusing on practical algorithms and implementations (Section 2.1) and one on theoretical guarantees (Section 2.2).

2.1 Literature on BO Practical Enhancements

We overview some existing approaches that aim to increase the scalability or efficiency of BO:

Sparse GP. The first approach comprises sparse pseudo-input GP [55, 35, 59, 43, 28, 17]. The idea is to select $m$ inducing pseudo-inputs by minimizing information loss, with $m$ smaller than number of collected data $n$ , to reduce the rank of the involved covariance matrix and subsequently the computational complexity. The second approach is GP with sparse spectrum [36]. By Bochner’s theorem [9], the kernel can be approximated by random features and thus GP can then be approximated by Bayesian linear regression with finite basis functions of size $m$ . In terms of complexity, both approaches reduce the time complexity from $\mathcal{O}(n^{3})$ to $\mathcal{O}(nm^{2}+m^{3})$ for the posterior distribution updating. A hybrid method of speeding up the posterior sampling by combining the two approximations is [72]. However, the involved approximation and hence information loss may cause imprecise surrogate prediction or uncertainty quantification [58].

Surrogate Substitution. Rather than working on scaling GP, there has been a surge of research on replacing BO with other surrogate models. SMAC [29] leverages random forest regressor and the standard deviation among trees to quantify uncertainty. However, the uncertainty quantification approach in this method, which utilizes bootstrap**, is unstable and can be overly small for unexplored areas. An alternative is to use neural networks for surrogate modeling. DNGO [60] adopts Bayesian linear regression on top of the the representations learnt by a neural network. Based on this, [51] proposes ABLR that improves the two-step learning in [60] to joint Bayesian learning but with additional computational complexity. BOHAMIANN [62] uses a modified Hamiltonian Monte Carlo to improve robustness and scalability of the surrogate model. [71] uses neural network ensemble. However, they require retraining the neural network after new data is collected, which is computationally expensive. Moreover, the performance appears sensitive to the network architecture [60].

Density Ratio Estimation-based Methods. Another line of research focuses on directly modeling the acquisition function [7, 65, 61]. [7] proposes tree-structured Parzen estimator (TPE) by establishing the equivalence between expected improvement (EI) and the relative ratio between two densities, specifically for handling discrete and tree-structured inputs. Recently, BORE [65] is invented to estimate this density ratio directly with likelihood-free inference and LFBO [61] generalizes this framework to model any acquisition function in the form of expected utility.

Prior-data Fitted Networks (PFNs). PFNs, initially proposed in [45], leverage in-context learning technique and Transformer architectures to approximate Bayesian inference. While PFNs significantly speed up the posterior inference, they require substantial pretraining. This pretraining procedure involves repeatedly drawing data points from the prior distribution to effectively learn the probabilistic predictions needed for Bayesian inference. Applications of PFNs to BO include [44] and [52].

Other Approaches. There are a variety of other works to speed up the computation of BO. One way is to leverage batch acquisition function, by proposing points in batch to be evaluated at once. Ensemble Bayesian Optimization [70] employs an ensemble of additive GPs as well as a batch acquisition function to scale BO to tens of thousands of observations. Other batch acquisition functions are proposed [14, 69, 57, 73, 25]. Another line of works focus on dimension reduction, such as BOCK [47] and HeSBO [46]. Recently, TuRBO [20] is proposed to incorporate BO with the trust region method and batch acquisition through Thompson sampling [64]. In terms of implementation, GPyTorch [23] scales GP computation to thousands of queries, with conjugate descent to solve linear systems and Lanczos process to approximate the log-determinant. By leveraging hardware development, BoTorch [4] is invented as the state-of-the-art implementation of BO, speeding up the computation of acquisition with Monte Carlo sampling, sample average approximation (SAA) and computational technologies like auto-differentiation and parallel computation on CPUs and GPUs.

As explained in our introduction, the above approaches are practically oriented, in the sense that they are empirically attractive but can lack theoretical guarantee even on basic consistency. A contribution from our PseudoBO framework is our investigation of simple and cheap algorithms that can perform competitively against these benchmarks, while at the same time exhibit consistency guarantee. Furthermore, we also note that PseudoBO is not meant to separate from the above works, in the sense that these developed tools can be combined with PseudoBO to offer better performance or accelerate the computation even more. For example, our experiments (in Section 6) include the combination with Trust Region to achieve higher scalability like in TuRBO, and the inner optimization of AF via Sobel sequence.

2.2 Literature on BO Theory

We divide the study of BO algorithms with theoretical guarantees into the investigation of consistency and more elaborate convergence rate analysis.

Consistency. [40] analyzes the consistency of BO with one-dimensional GP (in particular, Brownian motion) with EI. This was expanded upon in [68], which extends the consistency of EI to multi-dimensional settings, assuming the GP be stationary with the inverse of its spectral density exhibiting at most polynomial growth. Later, [6] generalizes the consistency to acquisitions of both EI and knowledge gradient (KG) for any Gaussian process with continuous sample paths. Besides, consistency results have also been established in various non-standard settings. For instance, [66] considers the objective as the sum or integral of multiple black-box functions. [2] considers objectives with inner network structures, where each node represents a black-box function with deterministic output. More generally, [1] studies composite objectives where a black-box function is composited with a function that has an explicitly known form. In the realm of simulation optimization, BO is applied to optimize expectation-form objective functions, and the simulator can generate aleatory uncertainties via possibly common random numbers, e.g., [50, 74]. Additionally, [67] considers black-box functions with input parameters that are unknown but observed via external data (i.e., the so-called input uncertainty in the stochastic simulation literature, e.g., [5, 16]). They propose Bayesian Information Collection and Optimization (BICO), which balances the trade-off between simulation and real data collection.

Convergence Rate Analysis. This is cast commonly in the form of regret, i.e., the difference between the attained objective value and the oracle best objective value attained by the unknown ground-truth solution. [63] derives the first sub-linear cumulative regret in the noisy setting for UCB acquisition. [11] analyzes the regret for the EGO algorithm in the deterministic setting. Recently, [39] establishes a regret bound for UCB when the objective belongs to a parametric family of functions, and [37] gives a concentration bound especially when using $\epsilon$ -greedy procedures in radial basis function interpolations. Finally, we mention that there is a large literature on multiarmed bandit problems, including for instance [33, 53] that develop Bayesian regret bounds for so-called Thompson sampling. However, the discrete or linear structures of these problems appear different from continuous-space black-box optimization. In particular, their uncertainty typically arises from data noise, while in our setting there can be zero noise from data and the uncertainty comes solely from the lack of knowledge about the unexplored portion of the objective function.

Except [39] and [37], all works above on analyzing BO focus on GP-based procedures. In contrast, PseudoBO derives an axiomatic dissection on consistent exploration-based algorithms beyond GP. In this way, we expand the scope of algorithms above that satisfy consistency. On the other hand, we do not offer rate results as in the second line of works above. However, we note that the theory on consistency already appears largely open beyond GP, and moreover avoids the opaque assumptions typically needed for sharper rate analyses.

3 Theory of Pseudo-Bayesian Optimization

Suppose we are interested in solving the optimization problem $\max_{x\in\mathcal{X}}f(x)$ , where the objective function $f:\mathcal{X}\to\mathbb{R}$ is unknown and $\mathcal{X}\subset\mathbb{R}^{d}$ is the decision space. The observations are deterministic. To introduce PseudoBO, we start with a simple but general algorithmic framework as follows. We optimize $f$ by using sequential function evaluations, where selecting which point to evaluate next is guided by some evaluation worthiness (EW) measure, say $W_{n}(x;\mathcal{D}_{n})$ . Here, $n$ is the step index in the procedure, and $\mathcal{D}_{n}=\{(x_{1},f(x_{1})),\ldots,(x_{n},f(x_{n}))\}$ denotes the collected data up to the $n$ -th step. For convenience, we also denote $X_{n}=\{x_{1},\ldots,x_{n}\}$ . In PseudoBO, at each step $n$ we solve $\max_{x\in\mathcal{X}}W_{n}(x;\mathcal{D}_{n})$ and the solution $x_{n+1}$ is set as the next point to evaluate. Thus, in summary, the PseudoBO algorithm is:

For each iteration $n=1,2,...,T$ :

Evaluate $f(x_{n})$ ;

Update $W_{n}(x;\mathcal{D}_{n})$ ;

Set $x_{n+1}\in\text{argmax}_{x\in\mathcal{X}}W_{n}(x;\mathcal{D}_{n})$ .

The estimated optimal solution at any step $n$ is $\hat{x}_{n}^{*}\in\text{argmax}_{x\in X_{n}}f(x)$ which gives the maximum evaluated function value so far.

3.1 Basic Algorithmic Consistency

Our first result characterizes the algorithmic consistency of PseudoBO. We denote $Z^{*}=\max_{x\in\mathcal{X}}f(x)$ as the optimal value of the target problem, $\Delta(x,S)=\min_{y\in S}\|x-y\|$ as the set distance from $x\in\mathcal{X}$ to the set $S\subset\mathcal{X}$ , and $E_{f}(S)=\{(x,f(x)):x\in S\}$ as the set of evaluated pairs $(x,f(x))$ for all $x\in S$ .

Assumption 3.1 (Sequential no-empty-ball property).

$W_{n}(\cdot\ ;\ \cdot)$ satisfies the following:

1.

For any $x\in\mathcal{X}$ and finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ , if $\inf_{n}\Delta(x,S_{n})>0$ , then $\liminf_{n\to\infty}W_{n}(x;A_{n})>0$ where $A_{n}=E_{f}(S_{n})$ .
2.

For any convergent sequence $x_{n}\in\mathcal{X}$ , i.e., $x_{n}\to x^{\prime}$ for some $x^{\prime}\in\mathcal{X}$ , we have $W_{n}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to 0$ , where $\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}$ and $A_{n}=E_{f}(S_{n})$ for any finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ .

Roughly speaking, part 1 of Assumption 3.1 stipulates that as along as there is no infinitesimally close evaluated point in the neighborhood of $x$ , the EW of $x$ is positive. Part 2 plays a converse role to state that if $x$ entails an approaching sequence, then the EW of $x$ would shrink to 0, and this is true with or without any additional data represented by the set sequence $\{A_{n}\}$ . The finite cardinality condition on the considered set sequences is imposed since we focus on realistic algorithms that can evaluate only a finite number of points, even though conceptually the assumption can be relaxed to include any set sequences (but in this case the EW for specific examples need to be properly defined). Note that Assumption 3.1 is purely about the function $W_{n}$ and its interaction with $f$ , and does not assume anything about the optimization procedure. We call Assumption 3.1 the sequential no-empty-ball (SNEB) property, where the no-empty-ball (NEB) notion follows from [68] and indicates that a zero value of $W_{n}$ at a point $x$ means any ball surrounding $x$ must contain some evaluated points in the past data, and vice versa. It is “sequential” because part 2 of the assumption modifies the original NEB property to consider the value of $W_{n}$ at the sequence $x_{n}$ instead of its limit $x^{\prime}$ as in [68]. Moreover, we also allow $W_{n}$ to be step-dependent, i.e., depend on step $n$ . The former modification facilitates the argument of our basic guarantee, while the latter is useful to apply to some important examples as we will see in the sequel.

We introduce our first theoretical result: As long as the EW $W$ satisfies the SNEB property, PseudoBO asymptotically reaches the true optimal value, or in other words it is algorithmically consistent.

Theorem 3.2 (Algorithmic consistency of PseudoBO).

Suppose EW $W_{n}$ satisfies Assumption 3.1 and $\mathcal{X}$ is compact. Then:

•

$\mathcal{X}$ is eventually populated by the PseudoBO iterates, i.e., for any $x\in\mathcal{X}$ , we have $\inf_{n}\Delta(x,X_{n})=0$ for $X_{n}=\{x_{1},\ldots,x_{n}\}$ where $x_{n}$ is the $n$ -th output iterate in the PseudoBO algorithm.
•

Consequently, if furthermore $f$ is continuous, then PseudoBO is algorithmically consistent, i.e., the estimated optimal solution $\hat{x}_{n}^{*}\in\text{argmax}_{x\in X_{n}}f(x)$ satisfies $f(\hat{x}_{n}^{*})\to Z^{*}$ as $n\to\infty$ .

Theorem 3.2 reveals the exploration-based nature of PseudoBO: It achieves convergence by populating the search space. More precisely, the first part of the theorem concludes eventual popularization, which means that any $x\in\mathcal{X}$ has arbitrarily close evaluated points from the PseudoBO iterates eventually. By selecting the historically best evaluated point, this popularization then turns into asymptotic convergence to the optimal value, which is the second part of the theorem. Here, while EW can incorporate many sources of information, a requirement is that it must contain information about the local popularity to guide us in this space popularization. In reality, we would like to make guesses and evaluate at points that are likely close to the best (exploitation), but also be cautious about missing out other potentially good regions (exploration). Accounting for this tradeoff requires a more specialized framework that contains ingredients to handle this issue more explicitly. We will describe these ingredients in the next subsection.

3.2 A More Specialized Framework

We consider a more specialized version of PseudoBO that materializes EW via three ingredients: surrogate predictor (SP), uncertainty quantifier (UQ) and acquisition function (AF). As discussed earlier, these ingredients appear in GP-based algorithms in BO but could be designed substantially more generally. Each of these ingredients needs to satisfy its own basic, independent, assumption, which we call local consistency, SNEB (introduced before) and the improvement property respectively.

Assumption 3.3 (Local consistency of SP).

The SP $\hat{f}(\cdot;\cdot):\mathcal{X}\times E_{f}(\mathcal{X})\to\mathbb{R}$ satisfies that for any convergent sequence $\{x_{n}\}\subset\mathcal{X}$ , i.e., $x_{n}\to x^{\prime}$ for some $x^{\prime}\in\mathcal{X}$ , we have $\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to f(x^{\prime})$ , where $\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}$ and $A_{n}=E_{f}(S_{n})$ for any set sequence $S_{n}\in\mathcal{X}$ .

In Assumption 3.3, $\hat{f}(x;\mathcal{D})$ represents the predictor at $x$ using data $\mathcal{D}$ . This assumption stipulates that the true function value at a target point can be approximated with increasing precision by $\hat{f}$ constructed at evaluation points converging to this target, with the historically evaluated points and any additional data.

Assumption 3.4 (SNEB property of UQ).

The UQ $\hat{\sigma}(\cdot;\cdot):\mathcal{X}\times E_{f}(\mathcal{X})\to\mathbb{R}$ satisfies:

1.

For any $x\in\mathcal{X}$ and finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ , if $\inf_{n}\Delta(x,S_{n})>0$ , then $\liminf_{n\to\infty}\hat{\sigma}(x;A_{n})>0$ where $A_{n}=E_{f}(S_{n})$ .
2.

For any convergent sequence $x_{n}\in\mathcal{X}$ , i.e., $x_{n}\to x^{\prime}$ for some $x^{\prime}\in\mathcal{X}$ , we have $\hat{\sigma}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to 0$ , where $\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}$ and $A_{n}=E_{f}(S_{n})$ for any finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ .

Note that the SNEB property in Assumption 3.4 on the UQ is exactly the same as Assumption 3.1. This highlights the key role of UQ as the driver of exploration and ultimately solution convergence via the EW framework. However, by incorporating the SP via the AF (discussed momentarily), we can induce exploitation to enhance practical performances.

Assumption 3.5 (Improvement property of AF).

The AF $g_{n}(\cdot,\cdot):\mathbb{R}\times\mathbb{R}_{+}\to\mathbb{R}$ satisfies the following (where $p_{n}$ and $q_{n}$ are any real sequences):

1.

$\liminf_{n\to\infty}g_{n}(p_{n},q_{n})>0$ if $\liminf_{n\to\infty}p_{n}>-\infty$ and $\liminf_{n\to\infty}q_{n}>0$ .
2.

$g_{n}(p_{n},q_{n})\to 0$ if $\limsup_{n\to\infty}p_{n}\leq 0$ and $q_{n}\to 0$ .

AF can be viewed as a channel to convert SP and UQ into EW. That is, the higher is the output of $g_{n}$ is, the more worthy to evaluate is the considered point. In Assumption 3.5, $p_{n}$ is the argument for the potential improvement regarding point estimation, and $q_{n}$ is the argument for the uncertainty. Part 2 of the assumption states that if there is, with eventual certainty, no improvement, then the worthiness to evaluate becomes zero. In contrast, part 1 stipulates that, as long as there is uncertainty, then there is some worthiness to evaluate the considered point (note that the condition $\liminf_{n\to\infty}p_{n}>-\infty$ there is largely a technicality that avoids $p_{n}$ being unboundedly negative). Finally, we allow AF to be step-dependent which is intended to make our PseudoBO framework general enough to cover common existing algorithms.

We are now ready to put together all the above ingredients into algorithmic consistency. First, for a set $\mathcal{D}\subset E_{f}(\mathcal{X})$ , denote $\Pi_{f}(\mathcal{D})=\{y:(x,y)\in\mathcal{D}\text{\ for some\ }x\in\mathcal{X}\}$ as the projection of $\mathcal{D}$ onto the output dimension.

Theorem 3.6 (From SP+UQ+AF to EW).

Suppose SP $\hat{f}$ , UQ $\hat{\sigma}$ , and AF $g_{n}$ satisfy Assumptions 3.3, 3.4 and 3.5 respectively. Suppose also that $f$ is continuous. Then the EW constructed by $W_{n}(x;\mathcal{D}_{n})=g_{n}(\zeta(\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(% \mathcal{D}_{n})),\hat{\sigma}(x;\mathcal{D}_{n}))$ , where $\zeta(\cdot)$ is continuous and non-decreasing and $\zeta(0)\leq 0$ , satisfies Assumption 3.1.

Based on Theorems 3.2 and 3.6, we obtain the following guarantee:

Corollary 3.7 (Algorithmic consistency via SP+UQ+AF).

Under the same assumptions as Theorem 3.6, for any compact $\mathcal{X}$ , PseudoBO with EW constructed by $W_{n}(x;\mathcal{D}_{n})=g_{n}(\zeta(\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(% \mathcal{D}_{n}))),\hat{\sigma}(x;\mathcal{D}_{n}))$ is algorithmically consistent.

3.3 $(\delta,\epsilon)$ -Relaxation of PseudoBO

The above PseudoBO framework guarantees eventual popularization and subsequently algorithmic consistency. In this subsection, we relax the popularization requirement to only a certain resolution, i.e., any point in $\mathcal{X}$ has eventually evaluated points within say a $\delta$ -sized neighborhood. This in turn leads to algorithmic consistency up to an error relating to $\delta$ . We study this relaxation for two reasons. One is that, in many problems, it makes sense to aim for near-optimality instead of exact optimality, because a small optimality gap can play a negligible practical role which is not cost-effective to close in. Second, and more importantly, we will see that some SPs are in fact fundamentally accurate only up to a certain level of error. For such SPs, their pairing UQs are naturally SNEB up to an associated relaxation level, which in turn induces the $\delta$ -relaxed eventual popularization.

We first state a relaxation of the SNEB property.

Assumption 3.8 ( $\delta$ -relaxed SNEB property).

For a given $\delta\geq 0$ , $W_{n}(\cdot\ ;\ \cdot)$ satisfies the following:

1.

For any $x\in\mathcal{X}$ and finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ , if $\inf_{n}\Delta(x,S_{n})>\delta$ , then $\liminf_{n\to\infty}W_{n}(x;A_{n})>0$ where $A_{n}=E_{f}(S_{n})$ .
2.

For any convergent sequence $x_{n}\in\mathcal{X}$ , i.e., $x_{n}\to x^{\prime}$ for some $x^{\prime}\in\mathcal{X}$ , we have $W_{n}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to 0$ , where $\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}$ and $A_{n}=E_{f}(S_{n})$ for any finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ .

Compared to Assumption 3.1, Assumption 3.8 only requires that $\liminf_{n\to\infty}W_{n}(x;A_{n})>0$ if $\inf_{n}\Delta(x,S_{n})>\delta$ instead of $\inf_{n}\Delta(x,S_{n})>0$ . That is, the point has positive EW if it is sufficiently, i.e., $\delta$ unit, far away from any eventually evaluated points. All other parts of the assumption remain the same as before. Assumption 3.8 is weaker than Assumption 3.1 when $\delta>0$ , and reduces back to Assumption 3.1 when $\delta=0$ .

With the above updated assumption, we obtain a corresponding relaxed version of algorithmic consistency. For this, we need to strengthen the continuity assumption of the objective function $f$ to Lipschitzness continuity. We call $f$ $L$ -Lipschitz if $|f(x)-f(x^{\prime})|\leq L\|x-x^{\prime}\|$ for any $x,x^{\prime}\in\mathcal{X}$ for some given constant $L>0$ .

Theorem 3.9 ( $\delta$ -relaxed algorithmic consistency of PseudoBO).

Suppose EW $W_{n}$ satisfies Assumption 3.8 and $\mathcal{X}$ is compact. Then:

•

$\mathcal{X}$ is eventually populated by the PseudoBO iterates up to $\delta$ -neighborhoods, i.e., for any $x\in\mathcal{X}$ , we have $\inf_{n}\Delta(x,X_{n})\leq\delta$ for $X_{n}=\{x_{1},\ldots,x_{n}\}$ where $x_{n}$ is the $n$ -th output iterate in the PseudoBO algorithm.
•

Consequently, if furthermore $f$ is $L$ -Lipschitz continuous, then PseudoBO is algorithmically consistent up to $L\delta$ error, i.e., the estimated optimal solution $\hat{x}_{n}^{*}\in\text{argmax}_{x\in X_{n}}f(x)$ satisfies $\liminf_{n}f(\hat{x}_{n}^{*})\geq Z^{*}-L\delta$ .

Like Theorem 3.2, part 1 of Theorem 3.9 concludes that the decision space $\mathcal{X}$ is eventually populated, but now up to $\delta$ -sized neighborhoods. Part 2 then translates this popularization to algorithmic consistency, now with an error that depends on $\delta$ and the Lipschitz constant $L$ .

Next, like Section 3.2, we specialize our relaxed framework to the use of SP, UQ and AF. We first consider a relaxed version of local consistency of SP.

Assumption 3.10 ( $\epsilon$ -relaxed local consistency of SP).

For a given $\epsilon\geq 0$ , the SP $\hat{f}(\cdot;\cdot):\mathcal{X}\times E_{f}(\mathcal{X})\to\mathbb{R}$ satisfies that for any convergent sequence $x_{n}\in\mathcal{X}$ , i.e., $x_{n}\to x^{\prime}$ for some $x^{\prime}\in\mathcal{X}$ , we have

\limsup_{n}|\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-f(x^{\prime})|\leq\epsilon

where $\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}$ and $A_{n}=E_{f}(S_{n})$ for any finite-cardinality set sequence $S_{n}\in\mathcal{X}$ .

Compared to Assumption 3.3, Assumption 3.10 stipulates that $\hat{f}$ is accurate only up to an $\epsilon$ error even when there is a sequence of evaluated points that gets infinitesimally close. That is, the predictor is fundamentally erroneous at level $\epsilon$ . Note that, when $\epsilon=0$ , Assumption 3.10 reduces back to Assumption 3.3.

Next, the following is the relaxed version of SNEB for the UQ $\hat{\sigma}$ , which is the same as that for EW in Assumption 3.8.

Assumption 3.11 ( $\delta$ -relaxed SNEB property of UQ).

For a given $\delta\geq 0$ , the UQ $\hat{\sigma}(\cdot;\cdot):\mathcal{X}\times E_{f}(\mathcal{X})\to\mathbb{R}$ satisfies:

1.

For any $x\in\mathcal{X}$ and finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ , if $\inf_{n}\Delta(x,S_{n})>\delta$ , then $\liminf_{n\to\infty}\hat{\sigma}(x;A_{n})>0$ where $A_{n}=E_{f}(S_{n})$ .
2.

For any convergent sequence $x_{n}\in\mathcal{X}$ , i.e., $x_{n}\to x^{\prime}$ for some $x^{\prime}\in\mathcal{X}$ , we have $\hat{\sigma}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to 0$ , where $\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}$ and $A_{n}=E_{f}(S_{n})$ for any finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ .

Finally, the improvement property of AF remains the same under the relaxed framework as before, i.e., we would still use Assumption 3.5 for AF. We then have the following guarantee.

Theorem 3.12 (From SP+UQ+AF to EW under $(\epsilon,\delta)$ -relaxation).

Suppose SP $\hat{f}$ , UQ $\hat{\sigma}$ , and AF $g_{n}$ satisfy Assumptions 3.10, 3.11 and 3.5 respectively. Suppose also that $f$ is continuous. Then the EW constructed by $W_{n}(x;\mathcal{D}_{n})=g_{n}(\zeta(\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(% \mathcal{D}_{n})-\epsilon),\hat{\sigma}(x;\mathcal{D}_{n}))$ , where $\zeta(\cdot)$ is continuous and non-decreasing and $\zeta(0)\leq 0$ , satisfies Assumption 3.8.

Compared to Theorem 3.6, in Theorem 3.12 we use the relaxed versions of the local consistency and SNEB properties. Moreover, we use $\zeta(\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})-\epsilon)$ instead of $\zeta(\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n}))$ in the first argument of $g_{n}(\cdot,\cdot)$ when defining the EW. Recall that the first argument of $g_{n}(\cdot,\cdot)$ represents the potential improvement regarding point estimation. When the SP has a fundamental inaccuracy of $\epsilon$ unit, we naturally consider potential improvement only up to the same amount of prediction inaccuracy, thus leading to the extra $\epsilon$ -reduction in this improvement calculation.

Lastly, based on Theorems 3.9 and 3.12, we obtain the following guarantee:

Corollary 3.13 (Algorithmic consistency via SP+UQ+AF under $(\epsilon,\delta)$ -relaxation).

Under the same assumptions as Theorem 3.12, for any compact $\mathcal{X}$ and assuming additionally that $f$ is $L$ -Lipschitz, PseudoBO with EW constructed by $W_{n}(x;\mathcal{D}_{n})=g_{n}(\zeta(\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(% \mathcal{D}_{n}))-\epsilon),\hat{\sigma}(x;\mathcal{D}_{n}))$ is algorithmically consistent up to $L\delta$ error.

Corollary 3.13 concludes that, when we define the potential improvement in a way that addresses the $\epsilon$ prediction inaccuracy, and the UQ is obtained with a $\delta$ -relaxed SNEB property, the resulting PseudoBO procedure would have algorithmic consistency up to an $L\delta$ error. Note that when $\epsilon=\delta=0$ , Corollary 3.13 reduces back to the non-relaxed case in Corollary 3.7. In fact, we note that the requirement of $\epsilon$ and $\delta$ are separate, i.e., the corollary holds even if only one of $\epsilon$ and $\delta$ is non-zero. That is to say, we can opt to aim for popularization up to $\delta$ -neighborhoods, and consequently $L\delta$ -optimality, when the SP has no fundamental error. Similarly, we might have an $\epsilon$ -inaccurate SP, but we choose an UQ that has a precise SNEB property instead of being $\delta$ -relaxed. This is indeed possible; however, as we will see in the next section, some natural choices of UQs that couple with an $\epsilon$ -relaxed locally consistent SP would only bear the $\delta$ -relaxed SNEB property. This latter behavior deems the relaxation on both the local consistency and the SNEB property important and naturally coupled.

We close this section by explaining why it is important to consider SPs that are $\epsilon$ -relaxed locally consistent. This arises from the fact that many common machine learning predictors contain hyperparameters typically needed to be tuned in relation to the sample size. Such tuning would ensure the predictor is locally consistent, but only if the design points are sampled according to certain distributions. As we define our local consistency (Assumptions 3.3 and 3.10) in a way that is free of any distributional assumption on the $\{x_{n}\}$ sequence, standard tuning approaches that aim for consistency would not apply. Our $\epsilon$ -relaxation serves to remedy this issue since, to achieve the relaxed local consistency, we can simply choose a fixed hyperparameter value instead of scaling it with the number of evaluated points.

4 The PseudoBO Cookbook

We present a range of examples for SP, UQ and AF to demonstrate the generality of PseudoBO and how it applies to existing algorithms as well as new ones. Importantly, it also paves the way for us to select practically superior algorithms. Our results can be summarized as a PseudoBO recipe in Figure 1.

Refer to caption — Fig. 1: A general recipe for configuring a PseudoBO algorithm.

4.1 SP with Local Consistency

The essence of local consistency is that the SP can correctly estimate the objective value with sufficient data around, given continuity of the objective. An exemplifying type of locally consistent SP is the class of interpolating models, which give prediction values exactly as the evaluated values for all the evaluated points, and similar to the values of the surrounding evaluated points otherwise. GP posterior mean, nearest neighbor and over-parameterized neural networks are examples. For our discussion below, we recall the notation $X_{n}=\{x_{1},\ldots,x_{n}\}\subset\mathcal{X}$ as a finite set of points that we select to evaluate, and $\mathcal{D}_{n}=\{(x_{1},f(x_{1})),\ldots,(x_{n},f(x_{n}))\}$ as the collection of evaluated pairs.

We start with GP posterior mean predictor. To define this, we first choose a positive semi-definite kernel function, $\mathcal{K}:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}$ , and a mean function, $\mu_{0}:\mathbb{R}^{d}\to\mathbb{R}$ . The mean function is commonly a constant function $\mu_{0}(x)=0$ for centered GPs or a low-order polynomial [22]. Given data $\mathcal{D}_{n}$ , the GP posterior mean predictor is defined as

\hat{f}_{\text{GP}}(x_{0};\mathcal{D}_{n})=\mathcal{K}(x_{0},X_{n})\mathcal{K}% (X_{n},X_{n})^{-1}(f(X_{n})-\mu_{0}(X_{n}))+\mu_{0}(x_{0}),

(1)

where $\mathcal{K}(x_{0},X_{n}):=[\mathcal{K}(x_{0},x_{1}),...,\mathcal{K}(x_{0},x_{n% })]$ , $\mathcal{K}(X_{n},X_{n}):=[\mathcal{K}(x_{i},x_{j})]_{1\leq i,j\leq n}$ , $\mu_{0}(X_{n}):=[\mu_{0}(x_{1}),...,\mu_{0}(x_{n})]$ and $f(X_{n}):=[f(x_{1}),...,f(x_{n})]$ , for a test point $x_{0}\in\mathcal{X}$ . The kernel function $\mathcal{K}$ and mean function $\mu_{0}$ correspond to the prior covariance and mean of the GP, and with data $\mathcal{D}_{n}$ we output the posterior mean of the GP as the SP (1). Note that the GP posterior mean predictor is an interpolator. Specifically, for any observed data point $(x_{i},f(x_{i}))\in\mathcal{D}_{n}$ , we have $\hat{f}_{GP}(x_{i};\mathcal{D}_{n})=f(x_{i})$ . The intuition behind this is that $K(x_{i},X_{n})K(X_{n},X_{n})^{-1}=e_{i}^{\intercal}$ , leading to $\hat{f}_{\text{GP}}(x_{i};\mathcal{D}_{n})=e_{i}^{\intercal}(f(X_{n})-\mu_{0}(% X_{n}))+\mu_{0}(x_{i})=f(x_{i})$ .

Proposition 4.1 (Local consistency of GP mean predictor).

Assume a GP’s covariance function $\mathcal{K}$ is continuous and positive definite. Also assume $f\in\mathcal{H}$ , the RKHS induced by $\mathcal{K}$ . Then the GP posterior mean is locally consistent.

Note that we do not make any probabilistic assumption regarding the black-box function in Proposition 4.1. This is the essence of our PseudoBO framework, that we only extract properties needed to guarantee algorithmic convergence instead of using probabilistic assumptions from the Bayesian framework.

Next, we consider the nearest neighbor predictor

\hat{f}_{\text{NN}}(x_{0};\mathcal{D}_{n})=f(\operatorname*{arg\,min}_{x\in X_% {n}}\left\lVert x-x_{0}\right\rVert)

(2)

for a test point $x_{0}\in\mathcal{X}$ .

Proposition 4.2 (Local consistency of nearest neighbor).

Assume $f$ is continuous. Then the nearest neighbor predictor is locally consistent.

Next, a neural network predictor at point $x_{0}$ is given by:

\hat{f}_{\text{Net}}(x_{0};\mathcal{D}_{n})=f_{\phi_{N},\mathbf{H}_{N}^{*}}% \circ f_{\phi_{N-1},\mathbf{H}_{N-1}^{*}}\circ...\circ f_{\phi_{1},\mathbf{H}_% {1}^{*}}(x_{0}),

(3)

where $f_{\phi_{i},\mathbf{H}_{i}}(x^{\prime}):=\phi_{i}({\mathbf{H}_{i}}[x^{\prime},% 1])$ , $\mathbf{H}_{1}\in\mathbb{R}^{d_{1}\times(d+1)}$ , $\mathbf{H}_{i}\in\mathbb{R}^{d_{i+1}\times(d_{i}+1)}$ for $i\geq 2$ , $N$ denotes the number of layers, $\phi_{i}$ denotes an activation function, and $(\mathbf{H}_{1}^{*},\ldots,\mathbf{H}_{N}^{*}):=\operatorname*{arg\,min}_{% \mathbf{H}_{1},\ldots,\mathbf{H}_{N}}\frac{1}{n}\sum_{k=1}^{n}(f_{\phi_{N},% \mathbf{H}_{N}}\circ f_{\phi_{N-1},\mathbf{H}_{N-1}}\circ...\circ f_{\phi_{1},% \mathbf{H}_{1}}(x_{k})-f(x_{k}))^{2}$ . Moreover, by over-parametrization we mean that for any dataset $\mathcal{D}_{n}$ , the hyperparameters, such as the depth of the network $N$ and the hidden dimensions $\{d_{i}\}_{1\leq i\leq N}$ can be adjusted so that $(\mathbf{H}_{1}^{*},\ldots,\mathbf{H}_{N}^{*})$ can be chosen to perfectly fit the data $\mathcal{D}_{n}$ , i.e., $\hat{f}_{\text{Net}}(x;\mathcal{D}_{n})=f(x)$ for any $x\in X_{n}$ .

Proposition 4.3 (Local consistency of over-parameterized neural network).

Assume a neural network predictor is $L$ -Lipschitz and over-parameterized. Then it is locally consistent.

In fact, a scrutiny at the proof of Proposition 4.3 reveals that its conclusion holds as long as the SP is $L$ -Lipschitz and over-parameterized. That is, we do not need the neural network structure in our argument. Nonetheless, neural network offers a natural class of SP where the needed properties can be attained.

Another type of locally consistent SP is the family of models that, in some sense, average the values of the surrounding evaluated points, and the prediction values at any evaluated points are not necessarily the evaluated values. This includes, for instance, regression trees and local regression.

We start with the regression tree predictor $\hat{f}_{\text{Tree}}(x_{0};\mathcal{D}_{n})$ , where the tree structure is parameterized by some splitting rule $\theta$ . The tree predictor at point $x_{0}$ is given by:

\hat{f}_{\text{Tree}}(x_{0};\mathcal{D}_{n})=\sum_{i=1}^{n}w_{i}(x_{0};\theta)% f(x_{i}),

(4)

where $w_{i}(x_{0};\theta)=\frac{\mathbbm{1}\{x_{i}\in R_{l(x_{0},\theta)}\}}{\sum_{i% =1}^{n}\mathbbm{1}\{x_{i}\in R_{l(x_{0},\theta)}\}}$ , and $R_{l(x_{0},\theta)}$ denotes the region covered by the leaf $l$ , determined by both the input $x_{0}$ and $\theta$ . If $\sum_{i=1}^{n}\mathbbm{1}\{x_{i}\in R_{l(x_{0},\theta)}\}=0$ , then we define $\hat{f}_{\text{Tree}}(x_{0};\mathcal{D}_{n})=\frac{1}{n}\sum_{i=1}^{n}f(x_{i})$ .

Proposition 4.4 ( $\epsilon$ -relaxed local consistency of regression tree).

Assuming $f$ is $L$ -Lipschitz, the regression tree predictor (4) parameterized by splitting rule $\theta$ is $2L\cdot\max_{x\in\mathcal{X}}\text{diam}(R_{l(x,\theta)})$ -relaxed locally consistent, where $\text{diam}(R)$ for a region $R$ refers to $\max_{x,y\in R}\|x-y\|.$

Proposition 4.4 stipulates that if we configure the regression tree such that all its leaves are small, then it is $\epsilon$ -relaxed locally consistent with an $\epsilon$ that scales with this small leaf size and the Lipschitz constant of $f$ .

Next we present the local kernel regression predictor:

\hat{f}_{\text{KR}}(x_{0};\mathcal{D}_{n})=\frac{\sum_{i\leq n}K\left(\frac{\|% x_{0}-x_{i}\|}{h}\right)f(x_{i})}{\sum_{i\leq n}K\left(\frac{\|x_{0}-x_{i}\|}{% h}\right)}

(5)

where $K$ is a non-negative function with bounded support on $[0,M]$ for some $M>0$ , i.e., $K(y)=0$ for $y>M$ and $K(y)>0$ for $y\leq M$ . The hyperparameter $h>0$ is the bandwidth. Definition (5) is well-defined as long as there is an $x_{i}$ for $i\leq n$ such that $\|x_{0}-x_{i}\|\leq Mh$ . If the latter does not hold, we output a prefixed constant, e.g., 0, as the prediction value. Note that we can modify all the above slightly if the bounded support is on $[0,M)$ , i.e., when $y=M$ , $K(y)=0$ instead of being positive, and all the discussion can be easily adopted accordingly. Definition 5 mimics the Nadaraya-Watson estimator. As a simple example, we can take $K(y)$ to be a positive constant for $y\in[0,M]$ .

Proposition 4.5 ( $\epsilon$ -relaxed local consistency of local kernel regression).

For local kernel regression predictor (5), where $K$ is a non-negative function with bounded support on $[0,M]$ for some $M>0$ , and $h>0$ is the bandwidth. Assume $f$ is $L$ -Lipschitz. Then the local kernel regression estimator is $LMh$ -relaxed locally consistent.

In typical statistical contexts, the hyperparameters $\theta$ and $h$ in regression trees and local kernel regression scale with the sample size $n$ . However, here we take $\theta$ and $h$ to be fixed. As discussed before, unlike in conventional statistical contexts, we make no assumption on how $x_{i}$ are distributed; in fact, $x_{i}$ can be selected according to a completely deterministic process in the PseudoBO algorithm, and thus the standard way of tuning these hyperparameters does not apply. Instead, we fix them and aim for $\epsilon$ -relaxed local consistency, i.e., we pay a small price on the convergence due to fixing these hyperparameters, in exchange for our capability to bypass the tuning issue.

Local consistency can be preserved if we apply the locally consistent SP together with a “randomized prior”. This randomized prior means we first sample from a random field to obtain a function, then we perturb the function values of the evaluated points by this randomly generated function and construct a predictor. When we output the prediction values, we compensate the prediction from the perturbed data by this function. This idea is more relevant when we discuss UQ, where the randomized prior serves as a natural artifact to quantify uncertainty. Here, we want to point out that an average over these perturbed predictions also give rise to valid SPs.

More precisely, let $r:\mathcal{X}\to\mathbb{R}$ be a continuous function randomly generated from a generating distribution $\mathcal{R}$ . Suppose a “base” predictor is $\hat{f}$ . The randomized prior mean prediction is

\hat{f}_{\text{RP}}(x_{0};\mathcal{D}_{n})=\mathbb{E}_{r\sim\mathcal{R}}[\hat{% f}^{(r)}(x_{0};\mathcal{D}_{n})],

(6)

where $\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})=r(x_{0})+\hat{f}(x_{0};\mathcal{D}_{n}^{(% r)})$ is a “perturb-then-compensate” predictor using the function $r$ , and $\mathcal{D}_{n}^{(r)}:=\{(x_{i},f(x_{i})-r(x_{i}))\}$ is the $r$ -perturbed data set.

Proposition 4.6 (Preservation of $\epsilon$ -relaxed local consistency of randomized prior mean).

Suppose $r$ is a continuous function randomly generated from the generating distribution $\mathcal{R}$ , and $\hat{f}^{(r)}$ is $\epsilon$ -relaxed locally consistent in predicting $f$ for almost surely any $r$ . Also assume $f,\hat{f}^{(r)}$ are uniformly bounded. Then the randomized prior mean (6) is $\epsilon$ -relaxed locally consistent.

Lastly, we have another useful preservation result for local consistency. Specifically, if we have a class of ( $\epsilon$ -relaxed) locally consistent SPs, then their convex combination, which we call a hybrid SP, will also enjoy ( $\epsilon$ -relaxed) local consistency.

Proposition 4.7 (Preservation of $\epsilon$ -relaxed local consistency of hybrid SP).

Given a finite set of $\epsilon$ -relaxed locally consistent SPs $\hat{\mathcal{F}}:=\{\hat{f_{1}},\hat{f_{2}},...\}$ , the hybrid SP $\sum_{i=1}^{|\hat{\mathcal{F}}|}\alpha_{i}\hat{f}_{i}$ , where $\sum_{i=1}^{|\hat{\mathcal{F}}|}\alpha_{i}=1$ and $\alpha_{i}\geq 0$ , is $\epsilon$ -relaxed locally consistent.

Clearly, Propositions 4.6 and 4.7 apply to the basic notion of local consistency in Definition 3.3 as a special case when taking $\epsilon=0$ . Moreover, in the latter case, Proposition 4.7 can be strengthened to allow $\alpha_{i}$ to be negative by using a similar proof.

4.2 UQ with SNEB Property

SNEB in a UQ entails that the quantified uncertainty at the queried points decreases to 0 as the surrounding evaluated points populate, while the uncertainty stays positive for any unexplored area. To verify SNEB for a UQ that is associated with a particular SP, it is typically the case that if the SP is locally consistent and that the UQ is a “truthful” representation of the uncertainty, then part 2 of the SNEB property in Assumption 3.4 holds. On the other hand, the verification of part 1 of that assumption would require further conditions, as it is about the unexplored region where local consistency property does not touch upon.

We first consider the GP posterior variance (or standard deviation). This UQ, at point $x_{0}$ , is given by:

\hat{\sigma}_{\text{GP}}^{2}(x_{0};\mathcal{D}_{n}):=\mathcal{K}(x_{0},x_{0})-% \mathcal{K}(x_{0},X_{n}){\mathcal{K}(X_{n},X_{n})^{-1}}\mathcal{K}(X_{n},x_{0})

(7)

which is derived through modeling $f$ as a GP with prior covariance represented via $\mathcal{K}$ [54]. Note that $\hat{\sigma}_{\text{GP}}^{2}(x_{0};\mathcal{D}_{n})$ does not depend on the prior mean function $\mu_{0}$ or the labels in the data as seen from (7).

Furthermore, we call a covariance function stationary if it has the property $\mathcal{K}(x,y)=\mathcal{K}(x-y)$ . That is, the function evaluations on the two inputs $x$ and $y$ only depend on their difference. This property is commonly imposed for modeling black-box functions, and popular examples are Gaussian covariance and Matérn covariance:

\mathcal{K}_{Gaussian}(x,y)=\exp\{-\frac{\|x-y\|^{2}}{2h^{2}}\},

(8)

where the bandwidth $h$ is a positive hyperparameter, and

\mathcal{K}_{Matern}(x,y)=\frac{2^{1-\nu}}{\Gamma(\nu)}(\sqrt{2\nu}\frac{\|x-y% \|}{\rho})^{\nu}K_{\nu}(\sqrt{2\nu}\frac{\|x-y\|}{\rho}),

(9)

where $\Gamma$ is the gamma function, $K_{\nu}$ is the modified Bessel function of the second kind, and $\rho$ , $\nu$ are positive hyperparameters for the covariance function.

Finally, the spectral density of a GP characterizes its frequency domain. For a stationary GP, the covariance function $\mathcal{K}(\tau)$ , where $\tau:=x-y$ , can be transformed into the frequency domain using the Fourier transform, which gives us the spectral density. Mathematically, the spectral density $\mathcal{S}(\omega)$ is defined as the Fourier transform of the covariance function:

\mathcal{S}(\omega)=\int_{-\infty}^{\infty}\mathcal{K}(\tau)\exp(-i\omega\tau)% \text{d}\tau.

With all the above, we have the following guarantee:

Proposition 4.8 (SNEB of GP posterior standard deviation).

Under the same assumption as in Proposition 4.1, and additionally, the GP is stationary and has spectral density $\mathcal{S}$ , with the property that $\mathcal{S}^{-1}$ has at most polynomial growth, then the GP posterior standard deviation $\hat{\sigma}_{\text{GP}}(x_{0};\mathcal{D}_{n})$ is SNEB.

Proposition 4.8 largely follows from the seminal GP result in [68]. Next, we also have that the minimum distance UQ is SNEB.

Proposition 4.9 (SNEB of minimum distance).

$\Delta$ is SNEB.

Next we present a UQ utilizing randomized prior, which we have briefly discussed in the SP examples in Section 4.1. This method first randomly generates a continuous function $r$ from a generating distribution $\mathcal{R}$ . We fit the SP $\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})=r(x_{0})+\hat{f}(x_{0};\mathcal{D}^{(r)}_% {n})$ , where $\hat{f}$ is a base SP and $\mathcal{D}^{(r)}_{n}$ is a data set perturbed by $r$ , i.e., $\mathcal{D}_{n}^{(r)}:=(x_{i},f(x_{i})-r(x_{i}))_{i=1}^{n}$ . Upon repeating the sampling of $r$ many times, the UQ can be computed as $(\operatorname{Var}_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})])^% {\nicefrac{{1}}{{2}}}$ .

Here is the rationale of this randomized prior UQ. Suppose a test point $x_{0}$ is surrounded by many evaluated points. Then, if $f$ is continuous and $\hat{f}$ is locally consistent, we have $\hat{f}(x_{0};\mathcal{D}^{(r)}_{n})$ roughly equal to $f(x_{0})-r(x_{0})$ , so that $\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})$ becomes roughly $f(x_{0})$ by canceling out the $r(x_{0})$ . Thus, $\operatorname{Var}_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})]$ becomes close to zero. This signals a low uncertainty at $x_{0}$ . On the other hand, suppose $x_{0}$ has no surrounding evaluated points, or in other words the region surrounding $x_{0}$ is sparsely evaluated. Then, $\hat{f}(x_{0};\mathcal{D}^{(r)}_{n})$ would have weak reliance on $r$ . For example, $\hat{f}$ could just output a constant for $x_{0}$ when there is no neighboring evaluated points. In this case, $\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})$ will equal $r(x_{0})$ , which has a high variance. This signals a high uncertainty at $x_{0}$ . This “perturb-then-compensate” principle of randomized prior thus naturally distinguishes between densely and sparsely evaluated regions.

As far as we know, the concept of randomized prior appears initially in [48] to quantify uncertainty in deep reinforcement. Compared to the bootstrap [19], this approach appears procedurally similar but conceptually different. In particular, randomized prior uses the “perturb-then-compensate” principle to signal high uncertainty for unexplored regions, while the bootstrap is designed to quantify the statistical errors arising from data randomness. The latter arguably works only when data are sufficiently abundant, since its underlying resampling can accurately resemble the sampling distribution only in such situations. Moreover, the uncertainty it captures comes from data randomness, while in the PseudoBO setting there is no data randomness and the uncertainty comes purely from the lack of knowledge on the objective function. Finally, we mention that [48] proposes randomized prior with neural network; however, we generalize it here to broader regressors and can pair it with any other locally consistent SPs.

Finally, if the SP is $\epsilon$ -relaxed locally consistent instead of exactly locally consistent, we can use

((\operatorname{Var}_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})])% ^{\nicefrac{{1}}{{2}}}-\epsilon)_{+}

(10)

for any point $x_{0}$ , where $(y)_{+}$ is the positive part function, i.e., $=y$ if $y>0$ and $=0$ otherwise. This definition allows for the $\epsilon$ prediction error, so that when the prediction uncertainty signified by $(\operatorname{Var}_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})])^% {\nicefrac{{1}}{{2}}}$ is less than $\epsilon$ , we take the UQ as 0 in the PseudoBO procedure. Correspondingly, the UQ is positive only if the test point $x_{0}$ is sufficiently far away from all evaluated points, giving rise to a $\delta$ -relaxed notion of SNEB.

Proposition 4.10 ( $\delta$ -relaxed SNEB of randomized prior standard deviation).

Under the same assumptions as Proposition 4.6, and that $(\operatorname{Var}_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})])^% {\nicefrac{{1}}{{2}}}>\epsilon$ whenever $\inf_{n}\Delta(x_{0};\mathcal{D}_{n})>\delta$ for any $x_{0}$ and $\mathcal{D}_{n}$ , for some fixed $\delta>0$ . Then the UQ (10) is $\delta$ -relaxed SNEB.

We can apply randomized prior variance to local regression SP defined in (5), giving the following:

Corollary 4.11 ( $\delta$ -relaxed SNEB of randomized prior standard deviation for local regression).

Under the same assumptions as Proposition 4.5, suppose $r$ is a $\tilde{L}$ -Lipschitz and uniformly bounded function randomly generated from the generating distribution $\mathcal{R}$ , with $(\operatorname{Var}_{r\sim\mathcal{R}}[r(x)])^{\nicefrac{{1}}{{2}}}>(L+\tilde{% L})Mh$ for any $x\in\mathcal{X}$ . Also assume $f$ is bounded. Then the UQ $((\operatorname{Var}_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})])% ^{\nicefrac{{1}}{{2}}}-(L+\tilde{L})Mh)_{+}$ for any point $x_{0}$ , with base SP (5), is $Mh$ -relaxed SNEB.

Finally, like our discussion of SP, we also have the preservation of ( $\delta$ -relaxed) SNEB for a combination of UQs which we call hybrid UQ:

Proposition 4.12 (Preservation of $\delta$ -relaxed SNEB of hybrid UQ).

Given a finite set of $\delta$ -relaxed SNEB UQs $\hat{\Sigma}:=\{\hat{\sigma_{1}},\hat{\sigma_{2}},...\}$ , the hybrid UQ $\sum_{i=1}^{|\hat{\Sigma}|}\alpha_{i}\hat{\sigma_{i}}$ , where $\sum_{i=1}^{|\hat{\Sigma}|}\alpha_{i}=1$ and $\alpha_{i}\geq 0$ , is $\delta$ -relaxed SNEB.

Similar to our discussion of SP in Section 4.1, Proposition 4.12 reduces to the case of exact SNEB when $\delta=0$ . On the other hand, unlike there, we can readily see from the proof of Proposition 4.12 that we can relax the convex combination of UQs to any linear combination of UQs with weights $\alpha_{i}\geq 0$ for all $i$ with at least one of the weights being strictly positive. Nonetheless, from a practical perspective, scaling the UQs by a multiplicative factor does not change the PseudoBO performance as long as we suitably rescale the AF function.

4.3 AF with Improvement Property

The improvement property of AF, which signifies a zero EW for points that certainly lead to no improvement while positive for uncertain points, is satisfied by classical criteria such as probability of improvement (PI) and expected improvement (EI). These criteria can be written as

g_{n}^{\text{PI}}(p_{n},q_{n})=\begin{cases}\Phi\left(\frac{p_{n}-\tau}{q_{n}}% \right),&\text{if}\ q_{n}>0,\\ \mathbbm{1}\{p_{n}-\tau>0\},&\text{if}\ q_{n}=0.\\ \end{cases}

(11)

for PI, and

g_{n}^{\text{EI}}(p_{n},q_{n})=\begin{cases}q_{n}\phi(\frac{p_{n}-\tau}{q_{n}}% )+(p_{n}-\tau)\Phi(\frac{p_{n}-\tau}{q_{n}}),&\text{if}\ q_{n}>0,\\ \max\{p_{n}-\tau,0\},&\text{if}\ q_{n}=0.\\ \end{cases}

(12)

for EI, where $\phi(\cdot)$ and $\Phi(\cdot)$ denote the standard normal density and distribution functions, and $\tau\geq 0$ is a small hyperparameter to ensure smoothness at the “boundary” between $q_{n}=0$ and $q_{n}>0$ , especially in the PI case where we will set $\tau>0$ . Typically, the potential improvement $p_{n}$ is defined as $\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})$ , which is the value of the SP $\hat{f}$ above the best evaluated objective value $\max\Pi_{f}(\mathcal{D}_{n})$ , and the uncertainty estimate $q_{n}$ is taken as our UQ $\hat{\sigma}(x;\mathcal{D}_{n})$ . With these, (11) and (12) give

\text{PI}(x;\mathcal{D}_{n})=\begin{cases}\Phi\left(\frac{\hat{f}(x;\mathcal{D% }_{n})-\max\Pi_{f}(\mathcal{D}_{n})-\tau}{\hat{\sigma}(x;\mathcal{D}_{n})}% \right),&\text{if}\ \hat{\sigma}(x;\mathcal{D}_{n})>0,\\ \mathbbm{1}\{\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})-\tau>0\},% &\text{if}\ \hat{\sigma}(x;\mathcal{D}_{n})=0.\\ \end{cases}

(13)

and

			$\displaystyle\text{EI}(x;\mathcal{D}_{n})$		(14)
		$\displaystyle=$	$\displaystyle\begin{cases}\hat{\sigma}(x;\mathcal{D}_{n})\phi(\frac{\hat{f}(x;% \mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})-\tau}{\hat{\sigma}(x;\mathcal{D}% _{n})})+(\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})-\tau)\Phi(% \frac{\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})-\tau}{\hat{% \sigma}(x;\mathcal{D}_{n})}),\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{if}\ \hat{\sigma}(x;% \mathcal{D}_{n})>0,\\ \max\{\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})-\tau,0\},\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{if}\ \hat{\sigma}(x;% \mathcal{D}_{n})=0.\\ \end{cases}$		(14)

The above definitions are precisely $P(Z>\tau)$ and $E[(Z-\tau)_{+}]$ respectively, where $Z$ denotes a normal variable with mean $p_{n}=\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})$ and standard deviation $q_{n}=\hat{\sigma}(x;\mathcal{D}_{n})$ . To facilitate discussion, let $\tau=0$ for now. In the BO literature, $\hat{f}(x;\mathcal{D}_{n})$ corresponds to the posterior mean and $\hat{\sigma}(x;\mathcal{D}_{n})$ the posterior standard deviation, so that $P(Z>0)$ is the posterior probability under the GP model that a point $x$ has a higher objective value than $\max\Pi_{f}(\mathcal{D}_{n})$ , and $E[Z_{+}]$ is the expectation of this excess objective value. The following verifies that these classical PI and EI criteria satisfy our improvement property:

Proposition 4.13 (Improvement Property of PI).

With $\tau>0$ , PI defined as (11) has the improvement property.

Proposition 4.14 (Improvement Property of EI).

With $\tau\geq 0$ , EI defined as (12) has the improvement property.

Another important acquisition approach is the upper confidence bound (UCB) widely used in online learning, defined as $\text{UCB}(x;\mathcal{D}_{n})=\hat{f}(x;\mathcal{D}_{n})+\beta_{n}\hat{\sigma}% (x;\mathcal{D}_{n})$ where $\beta_{n}$ is a step-dependent positive hyperparameter to trade-off exploitation and exploration. To connect to our AF notion, consider

g_{n}^{\text{UCB}}(p_{n},q_{n})=\frac{p_{n}-\tau}{\beta_{n}}+q_{n}

(15)

where $\tau\geq 0$ is again a hyperparameter like PI and EI described earlier (in fact, we can merely set $\tau=0$ in this case). Plugging in $p_{n}=\hat{f}(x;\mathcal{D}_{n})-\max{\Pi_{f}(\mathcal{D}_{n})}$ and $q_{n}=\hat{\sigma}(x;\mathcal{D}_{n})$ , we can see that (15) is a rescaling of the more familiar form $\text{UCB}(x;\mathcal{D}_{n})$ and thus the maximizer is retained. More precisely, we have $\operatorname*{arg\,max}_{x\in\mathcal{X}}\hat{f}(x;\mathcal{D}_{n})+\beta_{n}% \hat{\sigma}(x;\mathcal{D}_{n})=\operatorname*{arg\,max}_{x\in\mathcal{X}}(% \hat{f}(x;\mathcal{D}_{n})-\max{\Pi_{f}(\mathcal{D}_{n})}-\tau)+\beta_{n}\hat{% \sigma}(x;\mathcal{D}_{n})=\operatorname*{arg\,max}_{x\in\mathcal{X}}\frac{p_{% n}-\tau}{\beta_{n}}+q_{n}$ where $p_{n}=\hat{f}(x;\mathcal{D}_{n})-\max{\Pi_{f}(\mathcal{D}_{n})}$ and $q_{n}=\hat{\sigma}(x;\mathcal{D}_{n})$ .

Proposition 4.15 (Improvement Property of UCB).

Suppose $\tau$ is fixed, $\beta_{n}\geq 0$ is a sequence that goes to $\infty$ as $n\to\infty$ , and $p_{n}$ is bounded. Then UCB defined as (15) has the improvement property.

Note that we have assumed the sequence $p_{n}$ is bounded in Proposition 4.15. When the objective function $f$ and SP $\hat{f}$ are bounded, then $p_{n}$ when plugged in as $\hat{f}(x;\mathcal{D}_{n})-\max{\Pi_{f}(\mathcal{D}_{n})}$ would be bounded as well, so that this additional assumption is readily achieved.

Finally, like SP and UQ, the improvement property is also preserved if we consider a combination of AFs, which we call hybrid AF:

Proposition 4.16 (Improvement property of hybrid AF).

Given a finite set of AFs $\mathcal{G}:=\{g_{1},g_{2},...\}$ with the improvement property, the hybrid AF $\sum_{i=1}^{|\mathcal{G}|}\alpha_{i}g_{i}$ , where $\sum_{i=1}^{|\mathcal{G}|}\alpha_{i}=1$ and $\alpha_{i}\geq 0$ , has the improvement property.

Similar to hybrid UQ, we can strengthen Proposition 4.16 to relax the convex combination to any linear combination of AFs with weights $\alpha_{i}\geq 0$ for any $i$ where at least one of these weights is strictly positive. Nonetheless, since a rescaling does not affect the maximizer of the AF, considering only convex combination is without loss of generality.

5 From Theory to Implementation

Our next goal is to leverage the PseudoBO recipe to construct empirically superior algorithms. We consider the following in choosing our ingredients for implementation:

Measuring UQ Quality. To look for a good UQ, we propose a criterion called calibrated coverage rate (CCR), which considers the quality of both SP and UQ. Given a training set $\mathcal{D}_{\text{train}}:=(X_{\text{train}},Y_{\text{train}})$ , a validation set $\mathcal{D}_{\text{val}}:=(X_{\text{val}},Y_{\text{val}})$ , and a test set $\mathcal{D}_{\text{test}}:=(X_{\text{test}},Y_{\text{test}})$ , suppose we have a pretrained SP $\hat{f}(\cdot;\mathcal{D}_{\text{train}})$ and UQ $\hat{\sigma}(\cdot;\mathcal{D}_{\text{train}})$ . We compute the test-set coverage rate of a “prediction interval” $[\hat{f}(x;\mathcal{D}_{\text{train}})-\lambda_{\text{val}}\hat{\sigma}(x;% \mathcal{D}_{\text{train}}),\hat{f}(x;\mathcal{D}_{\text{train}})+\lambda_{% \text{val}}\hat{\sigma}(x;\mathcal{D}_{\text{train}})]$ by

\mathds{P}_{(x,y)\sim\mathcal{D}_{\text{test}}}(x\in[\hat{f}(x;\mathcal{D}_{% \text{train}})-\lambda_{\text{val}}\hat{\sigma}(x;\mathcal{D}_{\text{train}}),% \hat{f}(x;\mathcal{D}_{\text{train}})+\lambda_{\text{val}}\hat{\sigma}(x;% \mathcal{D}_{\text{train}})]),

where $\lambda_{\text{val}}=\min_{\lambda\geq 0}\lambda$ is chosen such that $\mathds{P}_{(x,y)\sim\mathcal{D}_{\text{val}}}(x\in[\hat{f}(x;\mathcal{D}_{% \text{train}})-\lambda\hat{\sigma}(x;\mathcal{D}_{\text{train}}),\hat{f}(x;% \mathcal{D}_{\text{train}})+\lambda\hat{\sigma}(x;\mathcal{D}_{\text{train}})]% )=1$ , i.e., the prediction interval is constructed such that it is just wide enough to perfectly covers the validation set. The higher the coverage is on the test set, the better is the SP+UQ intuited.

Model Configuration. We configure the following models: PseudoBO with SP and UQ as randomized prior mean and standard deviation (PseudoBO-RP) with the base SP as local kernel regression, and PseudoBO with SP and UQ as local kernel regression and a hybrid of minimum distance and randomized prior standard deviation (PseudoBO-KR-Hyb), defined as:

\sigma_{\text{Hyb}}(x;\mathcal{D}_{n})=\alpha_{n}\Delta(x,\mathcal{D}_{n})+(1-% \alpha_{n})\sigma_{\text{RP}}(x,\tilde{\mathcal{D}}_{n}),\hskip 2.84526pt\text% {with}\hskip 5.69054pt\alpha_{n}:=e^{-\Delta(x;X_{n})\cdot n},

(16)

where $\tilde{\mathcal{D}}_{n}$ denotes the bootstrapped data. The intuition of constructing this UQ is that randomized prior and bootstrap** together can best quantify the local uncertainty. However, randomized prior with bootstrap** is not guaranteed to have the SNEB property. The latter can be achieved via a hybrid version that combines with the minimum-distance UQ, where $\alpha_{n}\rightarrow 1$ as $x$ gets closer to $X_{n}$ . Lastly, we also try the combination of trust region and PseudoBO-KR-Hyb, shorthanded as PseudoBO-KR-Hyb-TR.

Optimizing EW. From our empirical investigation, we recommend using a scrambled Sobol sequence and pick the point with the largest EW among the sequence. This appears to enhance our performance compared with random search or BFGS.

6 Empirical Evaluations

In this section, we perform extensive empirical experiments on a wide range of tasks: a toy example for testing the quality of UQ, $4$ synthetic black-box function optimization problems, $4$ hyperparameter tuning tasks, and $2$ robotic tasks. All these tasks are challenging and contain many local minima. The baseline models we consider in all the tasks comprise the standard BO, Random Search (RS) [8], SMAC [29], TPE [7], BORE [65], and LFBO [61]. In particular, BO is implemented using BoTorch [4], and SMAC is implemented using AutoML [38]. In the large-scale robotic experiments in Section 6.4 and 6.5, we also include TuRBO for comparison.

For AF, the benchmarks BO, SMAC, and our PseudoBO variants use EI; TPE, BORE are designed to model PI; LFBO models EI; TuRBO is designed optimally with Thompson sampling. Additional details regarding our experiments can be found in Appendix 9.

6.1 Calibrating SP+UQ

We start by comparing the quality of different SP+UQ pairs via their CRRs: GP posterior mean + standard deviation, neural network (NN) + minimum distance (MD), randomized prior (RP) mean + standard deviation with base SP as local kernel regression (KR), and KR + Hybrid. We test on three one-dimensional (1D) benchmark functions: $f_{1}(x)=(\sin(\pi w))^{2}+(w-1)^{2}\cdot(1+\sin(2\pi w)^{2})$ , where $w:=1+(x-1)/4$ , $x\in[-10,10]$ ; $f_{2}(x)=-20e^{-0.2|x|}-e^{\cos(2\pi x)}+20-e$ , $x\in[-10,5]$ ; $f_{3}(x)=\sin(10\cdot\pi x)/(2x)+(x-1)^{4}$ , $x\in[0.5,2.5]$ . We generate the training set $\mathcal{D}_{\text{train}}$ , the validation set $\mathcal{D}_{\text{val}}$ and the test set $\mathcal{D}_{\text{train}}$ by uniformly sampling $20$ , $10$ and $150$ points from the decision space at random, with their labels are evaluated by the objective function. The bisection method is used to find $\lambda_{\text{val}}$ .

	$f_{1}$		$f_{2}$		$f_{3}$
SP+UQ pair	CCR	width	CCR	width	CCR	width
GP	$0.91(\pm 0.12)$	$5.09(\pm 2.38)$	$\mathbf{0.93(\pm 0.04)}$	$6.29(\pm 1.15)$	$0.94(\pm 0.04)$	$1.57(\pm 0.19)$
NN + MD	$0.92(\pm 0.06)$	$5.37(\pm 1.90)$	$0.91(\pm 0.07$ )	$4.11(\pm 0.93)$	$\mathbf{0.96(\pm 0.03)}$	$1.59(\pm 0.04)$
RP	$0.88(\pm 0.10)$	$62.29(\pm 43.96)$	$0.89(\pm 0.07)$	$76.48(\pm 27.02)$	$0.93(\pm 0.04)$	$20.46(\pm 8.47)$
KR + Hybrid	$\mathbf{0.93(\pm 0.07)}$	$5.34(\pm 1.63)$	$0.92(\pm 0.08)$	$4.45(\pm 1.05)$	$\mathbf{0.96(\pm 0.04)}$	$0.93(\pm 0.36)$

Table 1: Coverage rate and width of calibrated “prediction interval” on the test set. Results are repeated for 10 runs. The reported numbers are the empirical averages of these 10 repetitions, with the values in brackets representing the empirical standard deviations.

From Table 1, we see that KR + Hybrid has the highest CCR in two out of the three test cases. GP and NN + MD have the highest CCR in one test case. RP achieves lower CCRs and large calibrated widths possibly because while RP is designed to flag large uncertainty for unexplored areas, without bootstrap** on the training set, its uncertainty quantification ability for explored areas can be less precise. To be concrete, Fig. 2 shows some exemplified results on the three benchmark problems.

6.2 Synthetic Black-Box Function Optimization

We test our considered methods on the minimization of $4$ well-known benchmark functions, including a 2D Goldstein-price function, a 2D Drop-wave function, a 6D Hartmann function, and a 10D Ackley function. For all the methods, we run $100$ iterations with $5$ initial samples for the two $2$ D functions, $500$ iterations with $10$ initial samples for the other two with higher dimensions. From Fig. 3, we see that PseudoBO-KR-Hyb achieves the lowest objective in three out of four tasks within the query budget. TuRBO demonstrates competitive performance by reaching the lowest objective in two out of four tasks. Moreover, we record the cumulative regret in Fig. 4. PseudoBO-KR-Hyb appears to have sublinear cumulative regrets against iteration numbers in all cases and is the best in three out of four tasks.

6.3 Hyperparameter Tuning

We test the methods on a more realistic task of neural network tuning [34]. Our main goal here is to find the optimal set of hyperparameters for a two-layered fully connected neural network, to achieve the best training outcomes on four UCI datasets [18]. There are $9$ hyperparameters, including initial learning rate, learning rate schedule, batch size, dropout rates, number of units, and activation function type for each of the two layers. This results in an expansive search space comprising $62,208$ possible configurations in total. Fig. 5 reveals that among all the methods, PseudoBO-KR-Hyb is the fastest one to uncover the optimal configuration in three out of four tasks. PseudoBO-RP does not perform well here possibly because the randomized prior does not produce a good fit for the tree-structure function in this task. Relatively competitive models in this task are LFBO, BORE, and SMAC: LFBO achieves the best in the Naval task and quite close to PseudoBO-KR-Hyb in the Parkinsons task. BORE and SMAC are able to locate the optimal configurations in three out of four tasks. Besides, we also record the cumulative regret for all methods in Fig. 6. The plot shows that PseudoBO-KR-Hyb has sublinear cumulative regret in the iteration number and has the lowest cumulative regret in three out of four tasks. In the second task, BO has the lowest cumulative regret. Although BO finds the best configuration in only one dataset, its cumulative regrets in these tasks appear competitive.

6.4 Robot Arm Pushing

This problem aims to control two robot arms with $14$ parameters to push two objects to target positions. The reward is evaluated by the ending positions of the two objects pushed by the two robot arms. See [70] for more details. In this task, we follow the procedure and setup in [20], conducting a batch of 50 queries in each iteration and performing optimization with a budget of $10000$ queries. All methods are initialized with $100$ queries. As demonstrated in Fig. 7 (1), all PseudoBO variants, particularly PseudoBO-KR-Hyb, converge significantly faster than other methods iteration-wise and eventually achieve a superior reward, except for TuRBO. When measured in runtime, Fig. 7 (2) illustrates that PseudoBO variants take merely 20s to achieve a reward of around 9 while most of the other methods remain at around 7, and TPE is around 8.5. It is also noteworthy that BO ceases at the 9,250-th query due to GPU being out of memory. Another point to note is that Hyperopt (the package implementing TPE) does not support batch evaluation. Consequently, TPE queries and proposes the next point sequentially, allowing it to utilize more information than other methods. This means its performance might decline if batch evaluation were possible.

6.5 Rover Trajectory Planning

An additional robotic task we test is rover trajectory optimization by determining the locations of $30$ points in a 2D plane, where the final reward is estimated by the ending position of the rover and cost incurred by collision. See [70] for more details. We follow the procedure and setup in [20] to perform a batch of queries of size 100 in each iteration and execute optimization within a $20,000$ -query budget. All methods are started with $200$ queries. Due to the memory issue, BO is difficult to run and not tested here. Besides, TPE is also excluded from the test, as sequentially querying and proposing next evaluations took excessively long time to finish $20,000$ queries. Fig. 7 (3) shows that in terms of iterations, TuRBO converges slightly faster at the outset, but is later caught up by PseudoBO-KR-Hyb-TR. Besides, all PseudoBO variants converge more rapidly than other methods and ultimately secure considerably better final rewards. In terms of runtime, Fig. 7 (4) shows TuRBO converges as fast as PseudoBO-KR-Hyb-TR initially, but is surpassed by PseudoBO-KR-Hyb-TR at 125s, and later by PseudoBO-RP and PseudoBO-KR-Hyb at 400s. Eventually, PseudoBO-RP and PseudoBO-KR-Hyb yield rewards of around 3.8, PseudoBO-KR-Hyb-TR around 4.2 while TuRBO only reaches 3.5.

6.6 Runtime Record

We close our discussion by comparing the runtimes of the considered methods. Table 2 enumerates the runtimes of all tested methods to complete the target number of queries in various tasks. Some entries are vacant because the corresponding methods were not included in our experiments due to memory or excessive runtime issues. For smaller-scale experiments, we exclude PseudoBO-KR-Hyb-TR as scalability is not a concern; in the larger-scale tasks, BO is eliminated due to memory exhausion, while SMAC and TPE are not included due to their packages’ inability to support batch selection of candidates.

In small-sized tasks, our record reveals that RS and TPE demonstrate shorter runtimes than others. Immediately following these are the methods including BORE, LFBO, PseudoBO-RP, and PseudoBO-KR-Hyb, which all exhibit comparable runtimes. Despite being not as fast as the two methods above, their performances are usually more promising in comparison (especially in the neural network tuning task). While TuRBO is generally faster than BO, its runtime can be inconsistent. In the task of 10D synthetic function optimization, it considerably outpaces most of other methods, but this is not the case in the 12D neural network tuning task. This variability might arise from the restarting mechanism in the trust region method. Finally, both BO and SMAC consistently take the longest to complete the designated number of queries.

As the query budget grows up to tens of thousands in the robotic tasks, RS and TPE still stand in the fastest tier in terms of the runtime, immediately followed by the PseudoBO family. Notably, the scalability of the latter becomes even more apparent in these tasks in the sense that the time gap with TPE narrows and the PseudoBO family outperforms in terms of attained objective value significantly. Compared with other methods, in the 14D robotic task, the fastest PseudoBO model (PseudoBO-KR-Hyb-TR) is 5X faster than BORE, 5X than LFBO, 4X than TuRBO and 60X than BO. Likewise, in the 60D robotic task, PseudoBO-KR-Hyb-TR is 2.5X faster than BORE, 4X than LFBO and 5X than TuRBO. Fig. 8 shows a Pareto plot to illustrate the balance between the best objective value attained eventually and runtime.

Tasks	Synthetic (10D)	NN Tuning (12D)	Robot Push (14D)	Rover (60D)
Queries	500	300	10,000	20,000
RS	1.00	2.00	5.37	16.73
BO	$399.90^{*}$	$791.40^{*}$	$6202.55^{*}$	-
SMAC	747.4	200.14	-	-
TPE	16.50	6.00	67.78	-
BORE	37.10	17.20	514.37	1361.19
LFBO	66.70	23.88	575.38	2130.78
TuRBO	$20.5^{*}$	$56.08^{*}$	$496.09^{*}$	$2572.82^{*}$
PseudoBO-RP	42.03	32.90	127.55	513.08
PseudoBO-KR-Hyb	72.01	20.60	343.93	1453.00
PseudoBO-KR-Hyb-TR	-	-	107.22	510.60

Table 2: Runtime records in all types of tasks. Entries are in the unit of seconds. * on the top right of the entries means that the corresponding model was run on a GPU; otherwise, only CPU was used. Without GPU’s acceleration, even TuRBO (one of the most scalable state-of-the-art BO variants) takes approximately 4 hours to complete one run for the 60D Rover task.

7 Conclusion and Future Works

In this paper, we have re-evaluated the core principle of exploration-based black-box optimization. Our study is motivated by an apparent mismatch between BO theory and practice, in the sense that practical BO algorithms often lack theoretical guarantees offered by BO theory, which largely builds on GP-based procedures. To alleviate this mismatch, we propose a PseudoBO framework, which axiomatically dissects the minimal properties needed by the key ingredients in convergent black-box optimization algorithms. In particular, we show that if SP, UQ and AF satisfy their respective properties of local consistency, SNEB and improvement, then the assembled algorithm would eventually populate the decision space and consequently converges, and these properties can be further relaxed to achieve correspondingly relaxed consistency notions. Importantly, PseudoBO provides a recipe of algorithmic convergence beyond GP. That is, while GP is in the list of admissible ingredients, it is by no means the only option. From the recipe, we locate a combination of local regression as SP, a hybrid of randomized prior and minimum distance as UQ, and EI as AF, that gives rise to a consistent algorithm that performs competitively against existing state-of-the-art benchmarks across a wide range of tasks.

We view our work as a foundation for furthering the development of exploration-based black-box optimization that is beyond GP and theory-practice-balanced. Immediate extensions include the adaptation of our framework to handle noisy evaluations. In this situation, we face both epistemic uncertainty (due to lack of knowledge on the black-box objective function) and aleatory uncertainty (due to the evaluation noise). Much like the deterministic case, there has been works on extending GP-based algorithms to handle such problems, and our PseudoBO approach that dissects the key algorithmic ingredients could be generalized to tackle the additional aleatory uncertainty. Next is grey-box optimization. This consists of objective functions that can be composition of (multiple) black-box functions and analytically known functions. In this situation, we can individually model the SP and UQ for each underlying black-box functions and suitably aggregate them via a single AF, and achieve similar algorithmic consistency as in the full black-box case. A more challenging next step is to study rate results, but this likely requires an extent of opaque assumptions like in GP-based methods. Finally, an important direction is to provide a more principled approach to construct UQ, which is the core ingredient in inducing exploration, that can pair with specific SP in some arguably optimal sense, and also study the performance of such UQ relative to that of GP posterior.

Acknowledgements

We gratefully acknowledge support from the Amazon CAIT Fellowship, as well as the InnoHK initiative, the Government of the HKSAR, and Laboratory for AI-Powered Financial Technologies.

References

[1] R. Astudillo and P. Frazier. Bayesian optimization of composite functions. In International Conference on Machine Learning, pages 354–363. PMLR, 2019.
[2] R. Astudillo and P. Frazier. Bayesian optimization of function networks. Advances in neural information processing systems, 34:14463–14475, 2021.
[3] Y. Bai, H. Lam, T. Balch, and S. Vyetrenko. Efficient calibration of multi-agent simulation models from output series with bayesian optimization. In Proceedings of the Third ACM International Conference on AI in Finance, pages 437–445, 2022.
[4] M. Balandat, B. Karrer, D. R. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Bakshy. BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization. In Advances in Neural Information Processing Systems 33, 2020.
[5] R. R. Barton, H. Lam, and E. Song. Input uncertainty in stochastic simulation. In The Palgrave Handbook of Operations Research, pages 573–620. Springer, 2022.
[6] J. Bect, F. Bachoc, and D. Ginsbourger. A supermartingale approach to gaussian process based sequential design of experiments. arXiv preprint arXiv:1608.01118, 2016.
[7] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. Advances in neural information processing systems, 24, 2011.
[8] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 2012.
[9] S. Bochner. Harmonic analysis and the theory of probability. Courier Corporation, 2005.
[10] E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
[11] A. D. Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12(10), 2011.
[12] R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth. Bayesian optimization for learning gaits under uncertainty: An experimental comparison on a dynamic bipedal walker. Annals of Mathematics and Artificial Intelligence, 76:5–23, 2016.
[13] O. Chapelle and L. Li. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24, 2011.
[14] C. Chevalier and D. Ginsbourger. Fast computation of the multi-points expected improvement with applications in batch selection. In Learning and Intelligent Optimization: 7th International Conference, LION 7, Catania, Italy, January 7-11, 2013, Revised Selected Papers, pages 59–69. Springer, 2013.
[15] S. R. Chowdhury and A. Gopalan. No-regret algorithms for multi-task bayesian optimization. In International Conference on Artificial Intelligence and Statistics, pages 1873–1881. PMLR, 2021.
[16] C. G. Corlu, A. Akcay, and W. Xie. Stochastic simulation under input uncertainty: A review. Operations Research Perspectives, 7:100162, 2020.
[17] L. Csató and M. Opper. Sparse on-line gaussian processes. Neural computation, 14(3):641–668, 2002.
[18] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017.
[19] B. Efron. The jackknife, the bootstrap and other resampling plans. SIAM, 1982.
[20] D. Eriksson, M. Pearce, J. Gardner, R. D. Turner, and M. Poloczek. Scalable global optimization via local bayesian optimization. Advances in neural information processing systems, 32, 2019.
[21] P. Frazier, W. Powell, and S. Dayanik. The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing, 21(4):599–613, 2009.
[22] P. I. Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
[23] J. Gardner, G. Pleiss, K. Q. Weinberger, D. Bindel, and A. G. Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. Advances in neural information processing systems, 31, 2018.
[24] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
[25] J. González, Z. Dai, P. Hennig, and N. Lawrence. Batch bayesian optimization via local penalization. In Artificial intelligence and statistics, pages 648–657. PMLR, 2016.
[26] L. Györfi, M. Köhler, A. Krzyżak, and H. Walk. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002.
[27] A. Hebbal, L. Brevault, M. Balesdent, E.-G. Talbi, and N. Melab. Bayesian optimization using deep gaussian processes with applications to aerospace system design. Optimization and Engineering, 22:321–361, 2021.
[28] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. arXiv preprint arXiv:1309.6835, 2013.
[29] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pages 507–523. Springer, 2011.
[30] D. R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of global optimization, 21:345–383, 2001.
[31] D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455, 1998.
[32] K. Kandasamy, J. Schneider, and B. Póczos. High dimensional bayesian optimisation and bandits via additive models. In International conference on machine learning, pages 295–304. PMLR, 2015.
[33] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory: 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings 23, pages 199–213. Springer, 2012.
[34] A. Klein and F. Hutter. Tabular benchmarks for joint architecture and hyperparameter optimization. arXiv preprint arXiv:1905.04970, 2019.
[35] N. Lawrence, M. Seeger, and R. Herbrich. Fast sparse gaussian process methods: The informative vector machine. Advances in neural information processing systems, 15, 2002.
[36] M. Lázaro-Gredilla, J. Quinonero-Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal. Sparse spectrum gaussian process regression. The Journal of Machine Learning Research, 11:1865–1881, 2010.
[37] J. Li and I. O. Ryzhov. Convergence rates of epsilon-greedy global optimization under radial basis function interpolation. Stochastic Systems, 13(1):59–92, 2023.
[38] M. Lindauer, K. Eggensperger, M. Feurer, A. Biedenkapp, D. Deng, C. Benjamins, T. Ruhkopf, R. Sass, and F. Hutter. Smac3: A versatile bayesian optimization package for hyperparameter optimization. Journal of Machine Learning Research, 23(54):1–9, 2022.
[39] C. Liu and Y.-X. Wang. Global optimization with parametric function approximation. In International Conference on Machine Learning, pages 22113–22136. PMLR, 2023.
[40] M. Locatelli. Bayesian algorithms for one-dimensional global optimization. Journal of Global Optimization, 10(1):57, 1997.
[41] R. Marchant and F. Ramos. Bayesian optimisation for intelligent environmental monitoring. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 2242–2249. IEEE, 2012.
[42] R. Martinez-Cantin, N. de Freitas, A. Doucet, and J. A. Castellanos. Active policy learning for robot planning and exploration under uncertainty. In Robotics: Science and systems, volume 3, pages 321–328, 2007.
[43] M. McIntire, D. Ratner, and S. Ermon. Sparse gaussian processes for bayesian optimization. In UAI, 2016.
[44] S. Müller, M. Feurer, N. Hollmann, and F. Hutter. Pfns4bo: In-context learning for bayesian optimization. In International Conference on Machine Learning, pages 25444–25470. PMLR, 2023.
[45] S. Müller, N. Hollmann, S. P. Arango, J. Grabocka, and F. Hutter. Transformers can do bayesian inference. arXiv preprint arXiv:2112.10510, 2021.
[46] A. Nayebi, A. Munteanu, and M. Poloczek. A framework for bayesian optimization in embedded subspaces. In International Conference on Machine Learning, pages 4752–4761. PMLR, 2019.
[47] C. Oh, E. Gavves, and M. Welling. Bock: Bayesian optimization with cylindrical kernels. In International Conference on Machine Learning, pages 3868–3877. PMLR, 2018.
[48] I. Osband, J. Aslanides, and A. Cassirer. Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018.
[49] M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for global optimization. In 3rd international conference on learning and intelligent optimization (LION3), pages 1–15. Springer-Verlag Berlin, Germany, 2009.
[50] M. A. L. Pearce, M. Poloczek, and J. Branke. Bayesian optimization allowing for common random numbers. Operations Research, 70(6):3457–3472, 2022.
[51] V. Perrone, R. Jenatton, M. W. Seeger, and C. Archambeau. Scalable hyperparameter transfer learning. Advances in neural information processing systems, 31, 2018.
[52] H. Rakotoarison, S. Adriaensen, N. Mallik, S. Garibov, E. Bergman, and F. Hutter. In-context freeze-thaw bayesian optimization for hyperparameter optimization. arXiv preprint arXiv:2404.16795, 2024.
[53] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
[54] J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn. Design and analysis of computer experiments. Statistical Science, 4(4):409–423, 1989.
[55] M. W. Seeger, C. K. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse gaussian process regression. In International Workshop on Artificial Intelligence and Statistics, pages 254–261. PMLR, 2003.
[56] D. Sha, K. Ozbay, and Y. Ding. Applying bayesian optimization for calibration of transportation simulation models. Transportation Research Record, 2674(10):215–228, 2020.
[57] A. Shah and Z. Ghahramani. Parallel predictive entropy search for batch global optimization of expensive objective functions. Advances in neural information processing systems, 28, 2015.
[58] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
[59] E. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudo-inputs. Advances in neural information processing systems, 18, 2005.
[60] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. Adams. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pages 2171–2180. PMLR, 2015.
[61] J. Song, L. Yu, W. Neiswanger, and S. Ermon. A general recipe for likelihood-free bayesian optimization. In International Conference on Machine Learning, pages 20384–20404. PMLR, 2022.
[62] J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with robust bayesian neural networks. Advances in neural information processing systems, 29, 2016.
[63] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
[64] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
[65] L. C. Tiao, A. Klein, M. W. Seeger, E. V. Bonilla, C. Archambeau, and F. Ramos. Bore: Bayesian optimization by density-ratio estimation. In International Conference on Machine Learning, pages 10289–10300. PMLR, 2021.
[66] S. Toscano-Palmerin and P. I. Frazier. Bayesian optimization with expensive integrands. arXiv preprint arXiv:1803.08661, 2018.
[67] J. Ungredda, M. Pearce, and J. Branke. Bayesian optimisation vs. input uncertainty reduction. ACM Transactions on Modeling and Computer Simulation (TOMACS), 32(3):1–26, 2022.
[68] E. Vazquez and J. Bect. Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. Journal of Statistical Planning and inference, 140(11):3088–3095, 2010.
[69] J. Wang, S. C. Clark, E. Liu, and P. I. Frazier. Parallel bayesian global optimization of expensive functions. Operations Research, 68(6):1850–1865, 2020.
[70] Z. Wang, C. Gehring, P. Kohli, and S. Jegelka. Batched large-scale bayesian optimization in high-dimensional spaces. In International Conference on Artificial Intelligence and Statistics, pages 745–754. PMLR, 2018.
[71] C. White, W. Neiswanger, and Y. Savani. Bananas: Bayesian optimization with neural architectures for neural architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10293–10301, 2021.
[72] J. Wilson, V. Borovitskiy, A. Terenin, P. Mostowsky, and M. Deisenroth. Efficiently sampling functions from gaussian process posteriors. In International Conference on Machine Learning, pages 10292–10302. PMLR, 2020.
[73] J. Wu and P. Frazier. The parallel knowledge gradient method for batch bayesian optimization. Advances in neural information processing systems, 29, 2016.
[74] J. Xie, P. I. Frazier, and S. E. Chick. Bayesian optimization via simulation with pairwise sampling and correlated prior beliefs. Operations Research, 64(2):542–559, 2016.

Appendices

8 Proofs

Proof of Theorem 3.2.

We first prove part 1 of the theorem. Denote $\{x_{n}\}$ as the sequence of evaluation points chosen by PseudoBO, and correspondingly $\mathcal{D}_{n}=\{(x_{1},f(x_{1})),\ldots,(x_{n},f(x_{n}))\}$ is the data collection up to step $n$ . Since $\mathcal{X}$ is compact, $\{x_{n}\}$ has a convergent subsequence, say $\{x_{\pi(n)}\}$ . By Assumption 3.1 part 2, we have $W_{n}(x_{\pi(n)};\mathcal{D}_{\pi(n)-1})\to 0$ and hence $\liminf_{{n}\to\infty}W_{n}(x_{n};\mathcal{D}_{n-1})=0$ . On the other hand, suppose there is a point, say $x^{\prime}\in\mathcal{X}$ , that is non-adherent to $\{x_{n}\}$ , i.e., there is no subsequence in $\{x_{n}\}$ that converges to $x^{\prime}$ . Then, by Assumption 3.1 part 1, we have $\liminf_{{n}\to\infty}W_{n}(x^{\prime};\mathcal{D}_{n})>0$ . But this is impossible because PseudoBO requires at each step $x_{n+1}\in\text{argmax}_{x\in\mathcal{X}}W_{n}(x;\mathcal{D}_{n})$ , hence a contradiction.

Part 2 of the theorem follows by a straightforward use of the continuity of $f$ . More precisely, as $\mathcal{X}$ is compact and $f$ is continuous, $x^{*}\in\text{argmax}_{x\in\mathcal{X}}f(x)$ is well-defined and there is a subsequence of $\{x_{n}\}$ , say $\{x_{\pi^{\prime}(n)}\}$ , converging to $x^{*}$ by our first conclusion above. Then we have $f(x_{\pi^{\prime}(n)})\to f(x^{*})=Z^{*}$ by the continuity of $f$ . This gives

Z^{*}\geq f(\hat{x}_{n}^{*})\geq f(x_{\pi^{\prime}(n)})\to f(x^{*})=Z^{*},

which concludes $f(\hat{x}_{n}^{*})\to Z^{*}$ . ∎

Proof of Theorem 3.6.

Consider an arbitrary finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ and $A_{n}=E_{f}(S_{n})$ . Suppose $\inf_{n}\Delta(x,S_{n})>0$ . Then, by Assumption 3.4 part 1, we have $\liminf_{n\to\infty}\hat{\sigma}(x;A_{n})>0$ . By Assumption 3.5 part 1, we must have $\liminf_{n\to\infty}W_{n}(x,A_{n})>0$ .

To verify Assumption 3.1 part 2, consider a sequence $x_{n}\in X$ that converges to $x^{\prime}$ , and $A_{n}=E_{f}(S_{n})$ for an arbitrary finite-cardinality set sequence $S_{n}\subset X$ . By Assumption 3.4 part 2, we have $\hat{\sigma}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to 0$ where $\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}$ . Moreover, by Assumption 3.3 we have $\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to f(x^{\prime})$ . Thus,

\zeta(\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-\max\Pi_{f}(A_{n-1}\cup% \mathcal{D}_{n-1}))\leq\zeta(\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-f(x_{% n-1}))\to\zeta(0)\leq 0,

by the assumed continuity of $f$ and $\zeta$ , the non-decreasing property of $\zeta$ and $\zeta(0)\leq 0$ . Thus, by Assumption 3.5 part 2, we further have

W_{n}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})=g_{n}(\zeta(\hat{f}(x_{n};A_{n-1}% \cup\mathcal{D}_{n-1})-\max\Pi_{f}(A_{n-1}\cup\mathcal{D}_{n-1})),\hat{\sigma}% (x_{n};A_{n-1}\cup\mathcal{D}_{n-1}))\to 0.

This concludes Assumption 3.1 part 2. ∎

Proof of Theorem 3.9.

We first prove part 1 of the theorem. Denote $\{x_{n}\}$ as the sequence of evaluation points chosen by PseudoBO, and correspondingly $\mathcal{D}_{n}=\{(x_{1},f(x_{1})),\ldots,(x_{n},f(x_{n}))\}$ is the data collection up to step $n$ . Since $\mathcal{X}$ is compact, $\{x_{n}\}$ has a convergent subsequence, say $\{x_{\pi(n)}\}$ . By Assumption 3.8 part 2, we have $W_{n}(x_{\pi(n)};\mathcal{D}_{\pi(n)-1})\to 0$ and hence $\liminf_{{n}\to\infty}W_{n}(x_{n};\mathcal{D}_{n-1})=0$ . On the other hand, suppose there is a point, say $x^{\prime}\in\mathcal{X}$ , that satisfies $\inf_{n}\Delta(x^{\prime},X_{n})>\delta$ . Then, by Assumption 3.8 part 1, we have $\liminf_{{n}\to\infty}W_{n}(x^{\prime};\mathcal{D}_{n})>0$ . But this is impossible because PseudoBO requires at each step $x_{n+1}\in\text{argmax}_{x\in\mathcal{X}}W_{n}(x;\mathcal{D}_{n})$ , hence a contradiction.

Now we prove part 2 of the theorem. First, as $\mathcal{X}$ is compact and $f$ is continuous, $x^{*}\in\text{argmax}_{x\in\mathcal{X}}f(x)$ is well-defined. By our first conclusion above, for any $\eta>0$ , we can find an $x_{n}$ such that $\|x_{n}-x^{*}\|\leq\delta+\eta$ . Then we have

f(\hat{x}_{n}^{*})\geq f(x_{n})\geq f(x^{*})-L(\delta+\eta),

by the Lipschitzness of $f$ . Since $\eta$ is arbitrary, we have

\liminf_{n}f(\hat{x}_{n}^{*})\geq f(x^{*})-L\delta=Z^{*}-L\delta

which concludes the theorem. ∎

Proof of Theorem 3.12.

Consider an arbitrary finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ , and $A_{n}=E_{f}(S_{n})$ . Suppose $\inf_{n}\Delta(x,S_{n})>\delta$ . Then, by Assumption 3.11 part 1, we have $\liminf_{n\to\infty}\hat{\sigma}(x;A_{n})>0$ . By Assumption 3.5 part 1, we must have $\liminf_{n\to\infty}W_{n}(x,A_{n})>0$ .

To verify Assumption 3.8 part 2, consider a sequence $x_{n}\in X$ that converges to $x^{\prime}$ , and $A_{n}=E_{f}(S_{n})$ for an arbitrary finite-cardinality set sequence $S_{n}\subset X$ . By Assumption 3.11 part 2, we have $\hat{\sigma}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to 0$ where $\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}$ . Moreover, by Assumption 3.10 we have $\limsup_{n}\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\leq f(x^{\prime})+\epsilon$ . Thus,

\zeta(\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-\max\Pi_{f}(A_{n-1}\cup% \mathcal{D}_{n-1})-\epsilon)\leq\zeta(\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-% 1})-f(x_{n-1})-\epsilon)

and so

			$\displaystyle\limsup_{n}\zeta(\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-\max% \Pi_{f}(A_{n-1}\cup\mathcal{D}_{n-1})-\epsilon)$
		$\displaystyle\leq$	$\displaystyle\zeta(\limsup_{n}\{\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-f(% x_{n-1})-\epsilon\})$
		$\displaystyle\leq$	$\displaystyle\zeta(f(x^{\prime})+\epsilon-f(x^{*})-\epsilon)$
		$\displaystyle\leq$	$\displaystyle\zeta(0)$
		$\displaystyle\leq$	$\displaystyle 0$

by the assumed continuity of $f$ and $\zeta$ , the non-decreasing property of $\zeta$ and $\zeta(0)\leq 0$ . Thus, by Assumption 3.5 part 2, we further have

W_{n}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})=g_{n}(\zeta(\hat{f}(x_{n};A_{n-1}% \cup\mathcal{D}_{n-1})-\max\Pi_{f}(A_{n-1}\cup\mathcal{D}_{n-1})-\epsilon),% \hat{\sigma}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1}))\to 0.

This concludes Assumption 3.8 part 2. ∎

Proof of Proposition 4.1.

Proposition 10 in [68] assumes that $f\in\mathcal{H}$ , where $\mathcal{H}$ denotes the RKHS of $\mathcal{K}$ . Consider two sequences $\{x_{n}\}$ , $\{y_{n}\}\subset\mathcal{X}$ that is compact. Suppose $y_{n}\rightarrow y^{\prime}$ such that $y^{\prime}$ is adherent to $\{x_{n}\}$ . This is the condition (i) in Proposition 10 of [68]. A partial result of Proposition 10 demonstrates that under this condition, GP mean predictor $\hat{f}_{GP}(y_{n};\mathcal{D}_{n})\rightarrow f(y^{\prime})$ .

To further show it satisfying our local consistency assumption, consider a sequence $x_{n}\in\mathcal{X}$ that converges to $x^{\prime}$ , and $A_{n}=E_{f}(S_{n})$ for an arbitrary finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ . By definition, $x^{\prime}$ is an adherent point of $S_{n}\cup X_{n}$ . Therefore, we have $\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to f(x^{\prime})$ . ∎

Proof of Proposition 4.2.

Suppose $\{x_{n}\}$ is a sequence in $\mathcal{X}$ that converges to $x^{\prime}$ , and $A_{n}=E_{f}(S_{n})$ for an arbitrary finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ . Define $x^{\prime}_{n}:=\operatorname*{arg\,min}_{y\in X_{n-1}\cup S_{n-1}}||x_{n}-y||$ where $X_{n-1}=\{x_{1},\ldots,x_{n-1}\}$ . We have $\hat{f}_{\text{NN}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})=f(x^{\prime}_{n})$ . Note that $x^{\prime}_{n}\to x^{\prime}$ . Thus, $\hat{f}_{\text{NN}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})=f(x^{\prime}_{n})\to f% (x^{\prime})$ by the continuity of $f$ . ∎

Proof of Proposition 4.3.

Suppose $\{x_{n}\}$ is a sequence in $\mathcal{X}$ that converges to $x^{\prime}$ , and $A_{n}=E_{f}(S_{n})$ for an arbitrary finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ . We have

	$\displaystyle\|\hat{f}_{\text{Net}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-\hat{f}% _{\text{Net}}(x_{n-1};A_{n-1}\cup\mathcal{D}_{n-1})\|$	(17)
$\displaystyle\leq$	$\displaystyle L\left\lVert x_{n}-x_{n-1}\right\rVert\text{\ \ \ \ since $\hat{% f}_{\text{Net}}$ is $L-$Lipschitz}$
$\displaystyle\to$	$\displaystyle 0\text{\ \ \ \ by the convergence and hence Cauchy property of $% \{x_{n}\}$}$

Thus,

			$\displaystyle\|\hat{f}_{\text{Net}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-f(x^{% \prime})\|$
		$\displaystyle\leq$	$\displaystyle\|\hat{f}_{\text{Net}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-\hat{f}% _{\text{Net}}(x_{n-1};A_{n-1}\cup\mathcal{D}_{n-1})\|+\|\hat{f}_{\text{Net}}(x_{% n-1};A_{n-1}\cup\mathcal{D}_{n-1})-f(x^{\prime})\|$
		$\displaystyle\to$	$\displaystyle 0$

by the triangle inequality, (17) and $\hat{f}_{\text{Net}}(x_{n-1};A_{n-1}\cup\mathcal{D}_{n-1})=f(x^{\prime})$ thanks to the over-parametrization. ∎

Proof of Proposition 4.4.

Suppose $x_{n}$ is a sequence in $\mathcal{X}$ that converges to $x^{\prime}$ , and $A_{n}=E_{f}(S_{n})$ for an arbitrary finite-cardinality set sequence $S_{n}\in\mathcal{X}$ . Then there exists $N>0$ such that for all $n\geq N$ , we have $\|x_{n}-x^{\prime}\|\leq\text{diam}(R_{l(x^{\prime},\theta)})$ . For $n\geq N$ , we have

	$\displaystyle\|\hat{f}_{\text{tree}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-f(x^{% \prime})\|$	$\displaystyle=\|\sum_{i=1}^{n-1}\frac{\mathbbm{1}\{x_{i}\in R_{l(x_{n},\theta)}% \}}{\sum_{i=1}^{n-1}\mathbbm{1}\{x_{i}\in R_{l(x_{n},\theta)}\}}f(x_{i})-f(x^{% \prime})\|$
		$\displaystyle=\|\frac{1}{\|\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})\|}\sum_{x_{i}% \in\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})}f(x_{i})-f(x^{\prime})\|$
		$\displaystyle\leq\frac{1}{\|\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})\|}\sum_{x_{i}% \in\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})}\|f(x_{i})-f(x^{\prime})\|$
		$\displaystyle\leq\frac{1}{\|\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})\|}\sum_{x_{i}% \in\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})}L\\|x_{i}-x^{\prime}\\|$
		$\displaystyle\leq\frac{1}{\|\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})\|}\sum_{i\in% \mathcal{I}_{n-1}(R_{l(x_{n},\theta)})}L\cdot(\text{diam}(R_{l(x_{n};\theta)}+% \text{diam}(R_{l(x^{\prime};\theta)}))$
		$\displaystyle=2L\cdot\max_{x\in\mathcal{X}}\text{diam}(R_{l(x;\theta)}),$

where $|\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})|$ represents the cardinality of the set $\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})$ comprising all points in $A_{n-1}\cup\mathcal{D}_{n-1}$ located inside the region $R_{l(x_{n},\theta)}$ , the last but one inequality follows by the $L$ -Lipschitz property, and the last inequality follows from the triangle inequality. ∎

Proof of Proposition 4.5.

Suppose $x_{n}$ is a sequence in $\mathcal{X}$ that converges to $x^{\prime}$ , and $A_{n}=E_{f}(S_{n})$ for an arbitrary finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ . Hence $x_{n}$ is Cauchy. So, for any $n$ sufficiently large, we have $\|x_{n}-x_{n-1}\|\leq Mh$ . For any such $n$ , we have

	$\displaystyle\hat{f}_{\text{KR}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})$	$\displaystyle=\frac{\sum_{i\leq n-1}K\left(\frac{\\|x_{n}-x_{i}\\|}{h}\right)f(x% _{i})}{\sum_{i\leq n-1}K\left(\frac{\\|x_{n}-x_{i}\\|}{h}\right)}$
		$\displaystyle=\frac{\sum_{i\leq n-1}K\left(\frac{\\|x_{n}-x_{i}\\|}{h}\right)(f(% x_{n})+e_{n})}{\sum_{i\leq n-1}K\left(\frac{\\|x_{n}-x_{i}\\|}{h}\right)}$

where $|e_{n}|\leq L\|x_{i}-x_{n}\|$ by the $L$ -Lipschitzness of $f$ . Thus, the above is equal to $f(x_{n})+\tilde{e}_{n}$ where $|\tilde{e}_{n}|\leq LMh$ since $K$ has support $[0,M]$ .

So, we have

			$\displaystyle\limsup_{n}\|\hat{f}_{\text{Ker}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-% 1})-f(x^{\prime})\|$
		$\displaystyle\leq$	$\displaystyle\limsup_{n}\|\hat{f}_{\text{Ker}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-% 1})-f(x_{n})\|+\lim_{n}\|f(x_{n})-f(x^{\prime})\|$
		$\displaystyle\leq$	$\displaystyle LMh$

which concludes the proposition. ∎

Proof of Proposition 4.6.

Suppose $\{x_{n}\}$ converges to $x^{\prime}$ , and $A_{n}=E_{f}(S_{n})$ for an arbitrary finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ . For each $r\sim\mathcal{R}$ , by $\epsilon$ -relaxed local consistency and the continuity of $r$ , we have

	$\displaystyle\hat{f}^{(r)}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})$	$\displaystyle=r(x_{n})+\hat{f}(x_{n};A^{(r)}_{n-1}\cup\mathcal{D}^{(r)}_{n-1})$
		$\displaystyle=r(x^{\prime})+f(x^{\prime})-r(x^{\prime})+e_{n}=f(x^{\prime})+e_% {n}$		(18)

where $\limsup_{n}|e_{n}|\leq\epsilon$ . Thus, since $f,\hat{f}^{(r)}$ are bounded, we have

\limsup_{n}|E_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{n};A_{n-1}\cup\mathcal{D}_{n% -1})]-f(x^{\prime})|\leq E_{r\sim\mathcal{R}}[\limsup_{n}|\hat{f}^{(r)}(x_{n};% A_{n-1}\cup\mathcal{D}_{n-1})-f(x^{\prime})|]\leq\epsilon.

∎

Proof of Proposition 4.7.

Suppose $\{x_{n}\}$ converges to $x^{\prime}$ , and $A_{n}=E_{f}(S_{n})$ for an arbitrary finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ . Given a finite class of $\epsilon$ -relaxed locally consistent SPs $\hat{\mathcal{F}}=\{\hat{f}_{i}\}$ , we have

\limsup_{n}\left|\sum_{i=1}^{|\hat{\mathcal{F}}|}\alpha_{i}\hat{f}_{i}(x_{n};A% _{n-1}\cup\mathcal{D}_{n-1})-f(x^{\prime})\right|\leq\sum_{i=1}^{|\hat{% \mathcal{F}}|}\alpha_{i}\limsup_{n}|\hat{f}_{i}(x_{n};A_{n-1}\cup\mathcal{D}_{% n-1})-f(x^{\prime})|\leq\epsilon,

with the assumption that $\sum_{i=1}^{|\hat{\mathcal{F}}|}\alpha_{i}=1$ . ∎

Recall the NEB property of GP in [68]:

Assumption 8.1 (NEB property).

A GP has the NEB property if, for all sequence $\{x_{n}\}$ in $\mathcal{X}$ and all $x^{\prime}$ in $\mathcal{X}$ , the following statements are equivalent:

1.

$x^{\prime}$ is an adherent point of $\{x_{n}\}$ , i.e., there is a subsequence in $\{x_{n}\}$ that converges to $x^{\prime}$ .
2.

$\hat{\sigma}_{\text{GP}}^{2}(x^{\prime};\mathcal{D}_{n})\to 0$ as $n\to\infty$ .

Here $\hat{\sigma}_{\text{GP}}^{2}(x^{\prime};\mathcal{D}_{n})$ denotes the posterior variance of the GP with data $\mathcal{D}_{n}$ collected. We are ready to prove Proposition 4.8:

Proof of Proposition 4.8.

Proposition 10 in [68] verifies the NEB property of GP, under the assumptions that $\mathcal{X}$ is compact, the GP is centered, the kernel function $\mathcal{K}$ of the GP is continuous and stationary, and its spectral density $\mathcal{S}$ satisfying that $\mathcal{S}^{-1}$ is at most polynomial growth. We show our SNEB property below based on this result. Moreover, clearly it is equivalent to show the required properties for the posterior variance $\hat{\sigma}_{\text{GP}}^{2}(x^{\prime};\mathcal{D}_{n})$ as the posterior standard deviation $\hat{\sigma}_{\text{GP}}(x^{\prime};\mathcal{D}_{n})$ .

To show part 1 of Assumption 3.4, suppose $\inf_{n}\Delta(x;S_{n})>0$ for an arbitrary finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ . This implies $\inf_{n}\Delta(x;\cup_{k\leq n}S_{k})>0$ . Therefore, $x$ cannot be an adherent point of $\cup_{k\leq n}S_{k}$ . The implication of Assumption 8.1 from statement 2 to statement 1 tells us $\hat{\sigma}_{\text{GP}}^{2}(x^{\prime};\cup_{k\leq n}S_{k})\not\to 0$ , and since $\hat{\sigma}_{\text{GP}}^{2}(x^{\prime};\cup_{k\leq n}S_{k})\geq 0$ and is monotonically non-increasing on $n$ by the non-decreasing property of $\cup_{k\leq n}S_{k}$ , we must have $\liminf_{n\to\infty}\hat{\sigma}_{\text{GP}}^{2}(x^{\prime};S_{n})\geq\liminf_% {n\to\infty}\hat{\sigma}_{\text{GP}}^{2}(x^{\prime};\cup_{k\leq n}S_{k})>0$ .

Denote the GP as $\xi(\cdot)$ . To show part 2 of Assumption 3.4, suppose $x_{n}$ converges to $x^{\prime}$ , and $A_{n}=E_{f}(S_{n})$ for an arbitrary finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ . We have

	$\displaystyle\hat{\sigma}_{\text{GP}}^{2}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})$	$\displaystyle\stackrel{{\scriptstyle\textnormal{(a)}}}{{\mathstrut{\leq}}}\hat% {\sigma}_{\text{GP}}^{2}(x_{n};\mathcal{D}_{n-1})$
		$\displaystyle\stackrel{{\scriptstyle\textnormal{(b)}}}{{\mathstrut{=}}}% \operatorname{Var}[\xi(x_{n})-\hat{f}_{\text{GP}}(x_{n};\mathcal{D}_{n-1})]$
		$\displaystyle\stackrel{{\scriptstyle\textnormal{(c)}}}{{\mathstrut{\leq}}}% \operatorname{Var}[\xi(x_{n})-\xi(x_{n-1})]$
		$\displaystyle\stackrel{{\scriptstyle\textnormal{(d)}}}{{\mathstrut{=}}}% \mathcal{K}(x_{n},x_{n})-2\mathcal{K}(x_{n},x_{n-1})+\mathcal{K}(x_{n-1},x_{n-% 1})\stackrel{{\scriptstyle\textnormal{(e)}}}{{\mathstrut{\to}}}0,$

where $\hat{f}_{\text{GP}}$ is the posterior mean of the GP, and inequality (8) follows from the non-increasing property of the posterior variance as more points are sampled; the equality (8) follows by the definition of variance; the inequality (8) follows by the fact that the posterior mean is the $\mathcal{L}^{2}$ best linear predictor; the inequality (8) follows by expanding out the variance; and the convergence of (8) follows by the continuity of $\mathcal{K}$ and the convergence and hence Cauchyness of $x_{n}$ .

∎

Proof of Proposition 4.9.

To verify part 1 of Assumption 3.4, suppose $\inf_{n}\Delta(x;S_{n}\cup X_{n})>0$ . Then $\liminf_{n\to\infty}\Delta(x;S_{n}\cup X_{n})\geq\inf_{n}\Delta(x;S_{n}\cup X_% {n})>0$ .

To verify part 2 of Assumption 3.4, suppose $x_{n}\to x^{\prime}$ for some $x^{\prime}$ . We have $\Delta(x_{n};S_{n-1}\cup X_{n-1})\leq||x_{n}-x_{n-1}||\to 0$ since $x_{n}$ is Cauchy. ∎

Proof of Proposition 4.10.

Part 1 of Assumption 3.11 follows directly from the assumption of the proposition. To verify part 2, suppose $\{x_{n}\}$ converges to $x^{\prime}$ , and $A_{n}=E_{f}(S_{n})$ for an arbitrary finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ . From $\epsilon$ -relaxed local consistency of $\hat{f}^{(r)}$ and the continuity of $r$ , we have $\limsup_{n}|\hat{f}^{(r)}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-f(x^{\prime})|\leq\epsilon$ for any $r$ almost surely by following the first part of the proof of Proposition 4.6. Then, since $\hat{f}^{(r)}$ is uniformly bounded, we have

	$\displaystyle\limsup_{n}(Var_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{n};A_{n-1}% \cup\mathcal{D}_{n-1})])^{\nicefrac{{1}}{{2}}}$	$\displaystyle\leq\limsup_{n}(E_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{n};A_{n-1}% \cup\mathcal{D}_{n-1})^{2}])^{\nicefrac{{1}}{{2}}}$
		$\displaystyle\leq(E_{r\sim\mathcal{R}}[\limsup_{n}\hat{f}^{(r)}(x_{n};A_{n-1}% \cup\mathcal{D}_{n-1})^{2}])^{\nicefrac{{1}}{{2}}}$
		$\displaystyle\leq\epsilon$

and so

((Var_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})])^{% \nicefrac{{1}}{{2}}}-\epsilon)_{+}\to 0

∎

Proof of Corollary 4.11.

To apply Proposition 4.10, we verify the assumptions in Proposition 4.6 for the kernel regression SP, with $\epsilon=(L+\tilde{L})Mh$ . To this end, note that with the assumptions in Proposition 4.5 and the additional assumptions in Corollary 4.11, $f-r$ is an $(L+\tilde{L})$ -Lipschitz function. Thus, together with the continuity of $r$ , $\hat{f}^{(r)}$ is $(L+\tilde{L})Mh$ -relaxed locally consistent for almost surely any $r$ . Moreover, the uniform boundedness of $r$ and $f$ also implies the same property for $\hat{f}^{(r)}$ driven by the kernel regression SP. These verify all the assumptions in Proposition 4.6.

Next, note that when $\inf_{n}\Delta(x_{0};\mathcal{D}_{n})>Mh$ , we have $\hat{f}_{\text{Ker}}(x_{0};\mathcal{D}_{n}^{(r)})$ equal to a prefixed constant by the construction of the kernel regression SP. In this case, $\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})=r(x_{0})$ . So, by assumption we have $(\operatorname{Var}_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})])^% {\nicefrac{{1}}{{2}}}=(\operatorname{Var}_{r\sim\mathcal{R}}[r(x_{0})])^{% \nicefrac{{1}}{{2}}}>(L+\tilde{L})Mh$ . This verifies the last assumption in Proposition 4.10 with $\delta=Mh$ . Hence we conclude the corollary. ∎

Proof of Proposition 4.12.

Given a class of UQs $\hat{\Sigma}:=\{\hat{\sigma_{1}},\hat{\sigma_{2}},...\}$ with the $\delta$ -relaxed SNEB property, to show part 1 of Assumption 3.11, suppose $\inf_{n}\Delta(x;S_{n})>\delta$ , with $A_{n}=E_{f}(S_{n})$ for an arbitrary finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ . For any convex combination of $\hat{\Sigma}$ ,

\liminf_{n\to\infty}\sum_{i=1}^{|\hat{\Sigma}|}\alpha_{i}\hat{\sigma}_{i}(x;A_% {n-1}\cup\mathcal{D}_{n-1})=\sum_{i=1}^{|\hat{\Sigma}|}\alpha_{i}\liminf_{n\to% \infty}\hat{\sigma}_{i}(x;A_{n-1}\cup\mathcal{D}_{n-1})>0,

where the last inequality is inherited from part 1 of Assumption 3.11 for the individual UQs and that at least one of $\alpha_{i}$ ’s is positive since they sum up to 1.

To show part 2 of Assumption 3.11, suppose $\{x_{n}\}$ converges to $x^{\prime}$ , and $A_{n}=E_{f}(S_{n})$ for an arbitrary finite-cardinality set sequence $S_{n}\subset\mathcal{X}$ . For any convex combination of $\hat{\Sigma}$ ,

\lim_{n\to\infty}\sum_{i=1}^{|\hat{\Sigma}|}\alpha_{i}\hat{\sigma}_{i}(x_{n};A% _{n-1}\cup\mathcal{D}_{n-1})=\sum_{i=1}^{|\hat{\Sigma}|}\alpha_{i}\lim_{n\to% \infty}\hat{\sigma}_{i}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})=0.

∎

Proof of Proposition 4.13.

To verify part 1 of Assumption 3.5, suppose $\liminf_{n\to\infty}p_{n}>-\infty$ and $\liminf_{n\to\infty}q_{n}>0$ . Then, for $n\geq N$ for some large $N$ , we have $p_{n}\geq c_{1}>-\infty$ and $q_{n}\geq c_{2}>0$ for some $c_{1}$ and $c_{2}$ , and so $g_{n}^{\text{PI}}(p_{n},q_{n})\geq\inf_{n\geq N}\Phi((c_{1}-\tau)/q_{n})$ which is at least $\Phi((c_{1}-\tau)/c_{2})$ if $c_{1}-\tau<0$ , and $1/2$ if $c_{1}-\tau\geq 0$ . Thus, $\liminf_{n\to\infty}g_{n}^{\text{PI}}(p_{n},q_{n})>0$ .

To verify part 2 of Assumption 3.5, suppose $\limsup_{n\to\infty}p_{n}\leq 0$ and $q_{n}\to 0$ . Then $\limsup_{n\to\infty}p_{n}-\tau<0$ since $\tau>0$ and thus eventually $p_{n}-\tau\leq c<0$ for some $c$ , giving $g_{n}^{\text{PI}}(p_{n},q_{n})\to 0$ by directly using (12). ∎

Proof of Proposition 4.14.

To verify part 1 of Assumption 3.5, suppose $\liminf_{n\to\infty}p_{n}>-\infty$ and $\liminf_{n\to\infty}q_{n}>0$ . Then, for $n\geq N$ for a large enough $N$ , we have $p_{n}\geq c_{1}>-\infty$ and $q_{n}\geq c_{2}>0$ for some $c_{1},c_{2}$ . Note that $g_{n}^{\text{EI}}(p_{n},q_{n})=E(N(p_{n},q_{n})-\tau)_{+}$ , where $N(p_{n},q_{n})$ denotes a normal variable with mean $p_{n}$ and standard deviation $q_{n}$ . Suppose $n\geq N$ . We have $N(p_{n},q_{n})$ stochastically dominates $N(c_{1},q_{n})$ , and thus $E(N(p_{n},q_{n})-\tau)_{+}\geq E(N(c_{1},q_{n})-\tau)_{+}$ since $(\cdot-\tau)_{+}$ is a non-decreasing. Moreover, we have $N(c_{1},q_{n})$ second-order stochastically dominates $N(c_{1},c_{2})$ , and thus $E(N(c_{1},q_{n})-\tau)_{+}\geq E(N(c_{1},c_{2})-\tau)_{+}$ since $(\cdot-\tau)_{+}$ is non-decreasing and convex. Hence $g_{n}^{\text{EI}}(p_{n},q_{n})\geq g_{n}^{\text{EI}}(c_{1},c_{2})>0$ . This gives $\liminf_{n\to\infty}g_{n}(p_{n},q_{n})>0$ .

To verify part 2 of Assumption 3.5, suppose $\limsup_{n\to\infty}p_{n}\leq 0$ and $q_{n}\to 0$ . Then $g_{n}^{\text{EI}}(p_{n},q_{n})\to 0$ by directly using (12) and noting that the function is continuous even at $q_{n}=0$ . ∎

Proof of Proposition 4.15.

Since $\tau$ is fixed, $\beta_{n}\to\infty$ and $p_{n}$ is bounded, $\frac{p_{n}-\tau}{\beta_{n}}\to 0$ . To verify part 1 of Assumption 3.5, supposing $\liminf_{n\to\infty}q_{n}>0$ , we have $\liminf_{n\to\infty}g_{n}^{\text{UCB}}(p_{n},q_{n})=\liminf_{n\to\infty}\{% \frac{p_{n}-\tau}{\beta_{n}}+q_{n}\}=\liminf_{n\to\infty}q_{n}>0$ . To verify part 2, supposing $q_{n}\to 0$ , we have $g_{n}^{\text{UCB}}(p_{n},q_{n})=\frac{p_{n}-\tau}{\beta_{n}}+q_{n}\to 0$ . ∎

Proof of Proposition 4.16.

Both parts of Assumption 3.5 can be straightforwardly shown to preserve under convex combinations. ∎

9 Additional Experiment Details

9.1 Implementation Details

The methods we compare with are listed in Table 3.

Methods	Software library	URL
RS	Hyperopt	https://github.com/hyperopt/hyperopt
BO	BoTorch	https://botorch.org
TuRBO	-	https://github.com/uber-research/TuRBO
TPE	Hyperopt	https://github.com/hyperopt/hyperopt
BORE	Syne Tune	https://github.com/awslabs/syne-tune
LFBO	-	https://github.com/lfbo-ml/lfbo

Table 3: Package information.

In particular, for standard BO, we use Matérn Kernel with the default hyperparameter values in BoTorch. For BORE and LFBO, we use XGBoost as the classifier with preset parameters in its original implementations, since LFBO and BORE with XGBoost has relatively good and stable performances across tasks and are computationally much faster than other classifiers (e.g., random forest or neural network).

All experiments are conducted on a computer with a 4-core Intel(R) Xeon(R) CPU @ 2.30GHz and a Tesla T4 GPU.

9.1.1 Hyperparameters for PseudoBO Methods

Randomized prior functions. The random functions in all tasks are sampled from a random 3-layer neural network $r(x)=W_{3}\text{Tanh}(W_{2}\text{Tanh}(W_{1}x+b_{1})+b_{2})+b_{3}$ , with the Glorot random initialization [24].

Kernel. We use Gaussian kernel throughout all models and all tasks.

Bandwidths. For local kernel regression, we configure the bandwidth as $h_{n}(x)=(1-e^{-\Delta(x,X_{n})\cdot n})\cdot(h_{n}^{(u)}-h_{n}^{(l)})+h_{n}^{% (l)}$ , where $h_{0}^{(u)}\geq h_{0}^{(l)}>0$ are defined by $h_{n}^{(l)}=h_{0}^{(l)}\cdot n^{-\frac{1}{2+d}}$ , $h_{n}^{(u)}=h_{0}^{(u)}\cdot n^{-\frac{1}{2+d}}$ . For the randomized prior method, we adopt $h_{n}^{\prime}=h_{0}^{\prime}\cdot n^{-\frac{1}{2+d}}$ . From our empirical investigation, choosing $h_{0}^{(l)}$ , $h_{0}^{(u)}$ and $h_{0}^{\prime}$ to be small helps the performance of the PseudoBO models. Moreover, note that we have selected the bandwidths that depend on $n$ . This dependence on $n$ is negligible, and our theory for $(\epsilon,\delta)$ -relaxed algorithmic consistency guarantee recommends using a fixed bandwidth. However, we choose this light dependence on $n$ to match suggestions from standard minimax results in nonparametric statistics [26]. Typically, supposing $\mathcal{X}=[0,1]^{d}$ , choosing $h_{0}^{(l)}\in[0.075,0.1]$ , $h_{0}^{(u)}\in[0.2,0.4]$ , and $h_{0}^{\prime}=0.005$ (in hybrid randomized prior UQ) would produce a relatively promising performance.

Specifically, PseudoBO-RP uses $h_{0,i}^{\prime}=0.075(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})$ for synthetic functions optimization; $h_{0}^{\prime}=0.1\times[1/2,1/2,1/2,1/2,1/4,1/3,1/3,1/6,1/2,1/2,1/6,1/6]$ for neural network tuning, where the denominators in the vector represents the number of candidates in each hyperparameter (see Section 9.3 for further details); $h_{0,i}^{\prime}=0.1(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})$ for both RL tasks.

PseudoBO-KR-Hyb/PseudoBO-KR-Hyb-TR uses $h_{0}^{\prime}=0.01\times[1/2,1/2,1/2,1/2,1/4,/3,1/3,1/6,$ $1/2,1/2,1/6,1/6]$ in neural network tuning, and $h_{0,i}^{\prime}=0.005(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})$ in all other tasks for UQ. For SP, they use $h_{0,i}^{(l)}=0.05(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})$ , $h_{0,i}^{(u)}=0.2(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})$ in synthetic functions optimization; $h_{0,l}=0.2\times[1/2,1/2,1/2,1/2,1/4,/3,1/3,1/6,1/2,1/2,1/6,1/6]$ , $h_{0,u}=0.8\times[1/2,1/2,1/2,1/2,1/4,/3,1/3,1/6,1/2,1/2,1/6,1/6]$ in neural network tuning; $h_{0,i}^{(l)}=0.1(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})$ , $h_{0,i}^{(u)}=0.325(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})$ in robot pushing; $h_{0,i}^{(l)}=0.1(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})$ , $h_{0,i}^{(u)}=0.375(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})$ in rover trajectory planning.

Perturbation probability of Sobol sequence. As in TuRBO [20], we use the Sobol sequence for inner optimization of all PseudoBO methods, with perturbing probability of $1$ in the 2D synthetic function tasks, $0.75$ in the 6D synthetic function task, $0.5$ in the 10D synthetic function task, $0.4$ in the 12D neural network tuning task, $0.35$ in the 14D robot push task, and $0.15$ in the 60D rover trajectory task.

Robust UQ. In the task of neural network hyperparameter tuning, we observe that the collected query data bears a large variability, resulting in bad UQ estimates. Therefore, we perform winsorization on the data to ensure the UQ to work robustly. In particular, for all the data with objectives smaller than $q_{3}-5(q_{3}-q_{1})$ , we replace them with $q_{3}-5(q_{3}-q_{1})$ , where $q_{1}$ and $q_{3}$ represents the first and third quartiles of the data.

9.2 Calibrated Coverage Rate

We employ our proposed CCR criterion to assess how well the UQs of the considered methods, GP, NN+MD, RP and KR+Hyb, are calibrated. We generate the training set $\mathcal{D}_{\text{train}}$ , the validation set $\mathcal{D}_{\text{val}}$ and the test set $\mathcal{D}_{\text{train}}$ by uniformly sampling $20$ , $10$ and $150$ points from the decision space at random, with their labels evaluated by the black-box function.

We have purposefully designed the sizes of $\mathcal{D}_{\text{train}}$ , $\mathcal{D}_{\text{val}}$ , and $\mathcal{D}_{\text{test}}$ . The size of the validation set is smaller than the size of training set so that the calibration over the validation set does not disclose excessive information about the unknown function’s shape. Consequently, the quality of calibration depends on a combination of learning from the training set and the supplementary information from the validation set. Moreover, the size of test set is much larger than both for the purpose of a more accurate evaluation of the true performance of each UQ method.

To find $\lambda_{\text{val}}=\min_{\lambda\geq 0}\lambda$ such that

\mathds{P}_{(x,y)\sim\mathcal{D}_{\text{val}}}(x\in[\hat{f}(x;\mathcal{D}_{% \text{train}})-\lambda\hat{\sigma}(x;\mathcal{D}_{\text{train}}),\hat{f}(x;% \mathcal{D}_{\text{train}})+\lambda\hat{\sigma}(x;\mathcal{D}_{\text{train}})]% )=1,

we use the bisection approach in Algorithm 1:

Input: Pretrained SP

\hat{f}(\cdot;\mathcal{D}_{\text{train}})

, pretrained UQ

\hat{\sigma}(\cdot;\mathcal{D}_{\text{train}})

, validation set

\mathcal{D}_{\text{val}}

, and tolerance level

\epsilon

;

Output: Multiplier

\lambda_{\text{val}}

;

Initialize

\lambda_{l}\leftarrow 0

\lambda_{\text{init}}\leftarrow 1

\lambda_{u}\leftarrow\infty

while $\lambda_{u}=\infty$ do

Compute

CR_{\text{init}}\leftarrow\mathds{P}_{(x,y)\sim\mathcal{D}_{\text{val}}}(x\in[% \hat{f}(x;\mathcal{D}_{\text{train}})-\lambda_{\text{init}}\hat{\sigma}(x;% \mathcal{D}_{\text{train}}),\hat{f}(x;\mathcal{D}_{\text{train}})+\lambda_{% \text{init}}\hat{\sigma}(x;\mathcal{D}_{\text{train}})])

if $CR_{\text{init}}<1$ then

\lambda_{\text{init}}\leftarrow 2\cdot\lambda_{\text{init}}

else

\lambda_{u}\leftarrow\lambda_{\text{init}}

end if

end while

while $\lambda_{u}-\lambda_{l}>\epsilon$ do

\lambda_{\text{val}}\leftarrow(\lambda_{\text{l}}+\lambda_{\text{u}})/2

Compute

CR_{\text{val}}\leftarrow\mathds{P}_{(x,y)\sim\mathcal{D}_{\text{val}}}(x\in[% \hat{f}(x;\mathcal{D}_{\text{train}})-\lambda_{\text{val}}\hat{\sigma}(x;% \mathcal{D}_{\text{train}}),\hat{f}(x;\mathcal{D}_{\text{train}})+\lambda_{% \text{init}}\hat{\sigma}(x;\mathcal{D}_{\text{train}})])

if $CR_{\text{val}}<1$ then

\lambda_{l}\leftarrow\lambda_{\text{val}}

else

\lambda_{u}\leftarrow\lambda_{\text{val}}

end if

end while

Algorithm 1 Pre-trained SP and UQ Combo Calibration

Additional sample runs on the objective $f_{1}$ and $f_{2}$ are shown in Fig. 9 and 10.

9.3 Hyperparameter Tuning

This task originates from HPOBench [34]. The parameters to be tuned are shown in Table 4.

Hyperparameter	Choices
Initial LR	$\{0.0005,0.001,0.005,0.01,0.05,0.1\}$
Batch Size	$\{8,16,32,64\}$
LR Schedule	$\{$ cosine, fixed $\}$
Activation of Layer 1	$\{$ Tanh, ReLU $\}$
Width of Layer 1	$\{16,32,64,128,256,512\}$
Dropout rate of Layer 1	$\{0.0,0.3,0.6\}$
Activation of Layer 2	$\{$ Tanh, ReLU $\}$
Width of Layer 2	$\{16,32,64,128,256,512\}$
Dropout rate of Layer 2	$\{0.0,0.3,0.6\}$

Table 4: Tunable hyperparameters and search space in the HBOBench task.

In this task, we follow the procedure in [65] to densify the search space by one-hot encoding for categorical hyperparameters (including LR Schedule, activation function of layers $1$ and $2$ ), with each variable with domain [0,1] representing each category. For each of the other hyperparameters, we use one variable with domain [0,1] with space equalized separated for all candidates of that hyperparameter. Therefore, the search domain is $[0,1]^{12}$ .

9.4 Robot Arm Pushing

This 14D task is designed for optimizing the controllers of the robot’s two arms to push two objects to their target places $o_{1}$ and $o_{2}$ , starting from positions $s_{1}$ and $s_{2}$ . Let us denote the ending positions of the two objects by $e_{1}$ and $e_{2}$ , resulted from a specific control. The final reward is defined as

R(x)=\left\lVert s_{1}-o_{1}\right\rVert+\left\lVert s_{2}-o_{2}\right\rVert-(% \left\lVert s_{1}-e_{1}\right\rVert+\left\lVert s_{1}-e_{2}\right\rVert).

There are $14$ parameters to control the location and rotation of the robot hands, pushing speed, moving direction and pushing time, presented in Table 5.

Hyperparameter of Arm 1/2	Ranges
Position $x$	$[-5,5]$
Position $y$	$[-5,5]$
Angle	$[0,2\pi]$
Torque	$[-5,5]$
Velocity $v_{x}$	$[-10,10]$
Velocity $v_{y}$	$[-10,10]$
Push duration	$[2,30]$

Table 5: Tunable hyperparameters and search space in the robot arm pushing task.

9.5 Rover Trajectory Planning

This task is a $60$ D problem, where our target is to optimize the trajectory of the rover, determined by our choices of $30$ points, in a 2D plane. The reward is estimated in the following way:

f(x)=c(x)+\lambda(\left\lVert x_{0,1}-s\right\rVert_{1}+\left\lVert x_{59,60}-% o\right\rVert_{1})+b,

where $s$ and $o$ are the starting position and the target position, $x\in[0,1]^{60}$ containing the points picked, and $c(x)$ is a function to measure the cost of the trajectory determined by $x$ .

Instructions for running these two tasks can be found in https://github.com/uber-research/TuRBO and https://github.com/zi-w/Ensemble-Bayesian-Optimization.

	$\displaystyle\|\hat{f}_{\text{tree}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-f(x^{% \prime})\|$	$\displaystyle=\|\sum_{i=1}^{n-1}\frac{\mathbbm{1}\{x_{i}\in R_{l(x_{n},\theta)}% \}}{\sum_{i=1}^{n-1}\mathbbm{1}\{x_{i}\in R_{l(x_{n},\theta)}\}}f(x_{i})-f(x^{% \prime})\|$
		$\displaystyle=\|\frac{1}{\|\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})\|}\sum_{x_{i}% \in\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})}f(x_{i})-f(x^{\prime})\|$
		$\displaystyle\leq\frac{1}{\|\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})\|}\sum_{x_{i}% \in\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})}\|f(x_{i})-f(x^{\prime})\|$
		$\displaystyle\leq\frac{1}{\|\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})\|}\sum_{x_{i}% \in\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})}L\\|x_{i}-x^{\prime}\\|$
		$\displaystyle\leq\frac{1}{\|\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})\|}\sum_{i\in% \mathcal{I}_{n-1}(R_{l(x_{n},\theta)})}L\cdot(\text{diam}(R_{l(x_{n};\theta)}+% \text{diam}(R_{l(x^{\prime};\theta)}))$
		$\displaystyle=2L\cdot\max_{x\in\mathcal{X}}\text{diam}(R_{l(x;\theta)}),$

Abstract

1 Introduction

2 Related Works

2.1 Literature on BO Practical Enhancements

2.2 Literature on BO Theory

3 Theory of Pseudo-Bayesian Optimization

3.1 Basic Algorithmic Consistency

Assumption 3.1 (Sequential no-empty-ball property).

Theorem 3.2 (Algorithmic consistency of PseudoBO).

3.2 A More Specialized Framework

Assumption 3.3 (Local consistency of SP).

Assumption 3.4 (SNEB property of UQ).

Assumption 3.5 (Improvement property of AF).

Theorem 3.6 (From SP+UQ+AF to EW).

Corollary 3.7 (Algorithmic consistency via SP+UQ+AF).

3.3 (δ,ϵ)𝛿italic-ϵ(\delta,\epsilon)( italic_δ , italic_ϵ )-Relaxation of PseudoBO

Assumption 3.8 (δ𝛿\deltaitalic_δ-relaxed SNEB property).

Theorem 3.9 (δ𝛿\deltaitalic_δ-relaxed algorithmic consistency of PseudoBO).

Assumption 3.10 (ϵitalic-ϵ\epsilonitalic_ϵ-relaxed local consistency of SP).

Assumption 3.11 (δ𝛿\deltaitalic_δ-relaxed SNEB property of UQ).

Theorem 3.12 (From SP+UQ+AF to EW under (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-relaxation).

Corollary 3.13 (Algorithmic consistency via SP+UQ+AF under (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-relaxation).

4 The PseudoBO Cookbook

4.1 SP with Local Consistency

Proposition 4.1 (Local consistency of GP mean predictor).

Proposition 4.2 (Local consistency of nearest neighbor).

Proposition 4.3 (Local consistency of over-parameterized neural network).

Proposition 4.4 (ϵitalic-ϵ\epsilonitalic_ϵ-relaxed local consistency of regression tree).

Proposition 4.5 (ϵitalic-ϵ\epsilonitalic_ϵ-relaxed local consistency of local kernel regression).

Proposition 4.6 (Preservation of ϵitalic-ϵ\epsilonitalic_ϵ-relaxed local consistency of randomized prior mean).

Proposition 4.7 (Preservation of ϵitalic-ϵ\epsilonitalic_ϵ-relaxed local consistency of hybrid SP).

4.2 UQ with SNEB Property

Proposition 4.8 (SNEB of GP posterior standard deviation).

Proposition 4.9 (SNEB of minimum distance).

Proposition 4.10 (δ𝛿\deltaitalic_δ-relaxed SNEB of randomized prior standard deviation).

Corollary 4.11 (δ𝛿\deltaitalic_δ-relaxed SNEB of randomized prior standard deviation for local regression).

Proposition 4.12 (Preservation of δ𝛿\deltaitalic_δ-relaxed SNEB of hybrid UQ).

4.3 AF with Improvement Property

Proposition 4.13 (Improvement Property of PI).

Proposition 4.14 (Improvement Property of EI).

Proposition 4.15 (Improvement Property of UCB).

Proposition 4.16 (Improvement property of hybrid AF).

5 From Theory to Implementation

6 Empirical Evaluations

6.1 Calibrating SP+UQ

6.2 Synthetic Black-Box Function Optimization

6.3 Hyperparameter Tuning

6.4 Robot Arm Pushing

6.5 Rover Trajectory Planning

6.6 Runtime Record

7 Conclusion and Future Works

Acknowledgements

References

8 Proofs

Proof of Theorem 3.2.

Proof of Theorem 3.6.

Proof of Theorem 3.9.

Proof of Theorem 3.12.

Proof of Proposition 4.1.

Proof of Proposition 4.2.

Proof of Proposition 4.3.

Proof of Proposition 4.4.

Proof of Proposition 4.5.

Proof of Proposition 4.6.

Proof of Proposition 4.7.

Assumption 8.1 (NEB property).

Proof of Proposition 4.8.

Proof of Proposition 4.9.

Proof of Proposition 4.10.

Proof of Corollary 4.11.

Proof of Proposition 4.12.

Proof of Proposition 4.13.

Proof of Proposition 4.14.

Proof of Proposition 4.15.

Proof of Proposition 4.16.

9 Additional Experiment Details

9.1 Implementation Details

9.1.1 Hyperparameters for PseudoBO Methods

9.2 Calibrated Coverage Rate

9.3 Hyperparameter Tuning

3.3 $(\delta,\epsilon)$ -Relaxation of PseudoBO

Assumption 3.8 ( $\delta$ -relaxed SNEB property).

Theorem 3.9 ( $\delta$ -relaxed algorithmic consistency of PseudoBO).

Assumption 3.10 ( $\epsilon$ -relaxed local consistency of SP).

Assumption 3.11 ( $\delta$ -relaxed SNEB property of UQ).

Theorem 3.12 (From SP+UQ+AF to EW under $(\epsilon,\delta)$ -relaxation).

Corollary 3.13 (Algorithmic consistency via SP+UQ+AF under $(\epsilon,\delta)$ -relaxation).

Proposition 4.4 ( $\epsilon$ -relaxed local consistency of regression tree).

Proposition 4.5 ( $\epsilon$ -relaxed local consistency of local kernel regression).

Proposition 4.6 (Preservation of $\epsilon$ -relaxed local consistency of randomized prior mean).

Proposition 4.7 (Preservation of $\epsilon$ -relaxed local consistency of hybrid SP).

Proposition 4.10 ( $\delta$ -relaxed SNEB of randomized prior standard deviation).

Corollary 4.11 ( $\delta$ -relaxed SNEB of randomized prior standard deviation for local regression).

Proposition 4.12 (Preservation of $\delta$ -relaxed SNEB of hybrid UQ).