Pseudo-Bayesian Optimization

Haoxian Chen,1 Henry Lam1

1 Department of Industrial Engineering and Operations Research, Columbia University

Abstract

Bayesian Optimization is a popular approach for optimizing expensive black-box functions. Its key idea is to use a surrogate model to approximate the objective and, importantly, quantify the associated uncertainty that allows a sequential search of query points that balance exploitation-exploration. Gaussian process (GP) has been a primary candidate for the surrogate model, thanks to its Bayesian-principled uncertainty quantification power and modeling flexibility. However, its challenges have also spurred an array of alternatives whose convergence properties could be more opaque. Motivated by these, we study in this paper an axiomatic framework that elicits the minimal requirements to guarantee black-box optimization convergence that could apply beyond GP-based methods. Moreover, we leverage the design freedom in our framework, which we call Pseudo-Bayesian Optimization, to construct empirically superior algorithms. In particular, we show how using simple local regression, and a suitable “randomized prior” construction to quantify uncertainty, not only guarantees convergence but also consistently outperforms state-of-the-art benchmarks in examples ranging from high-dimensional synthetic experiments to realistic hyperparameter tuning and robotic applications.

1 Introduction

Bayesian Optimization (BO) is a popular method for the global optimization of black-box functions that are typically multimodal and expensive to evaluate. Its main idea is to model the objective landscape using observed data in order to “guess” where the global optimum is, rather than following a solution trajectory. One of the earliest works in BO is [31] in 1998, initially proposed for automotive and semiconductor designs and motivated by long runtimes of their computer codes. Nowadays, BO has been widely applied in machine learning including hyper-parameter tuning and Auto-ML [38], reinforcement learning [12], robotics [42], experimental design such as A/B testing [13], simulator calibration [56, 3], and engineering applications such as environmental monitoring [41] and aerospace system design [27]. Extensive overviews are in [10, 58, 22].

More precisely, a typical BO procedure sequentially searches for next design points to evaluate by updating a surrogate model to predict the objective and, moreover, quantify the associated uncertainty. Both the prediction and the uncertainty quantification are important as they work together, via a so-called acquisition function, to balance exploitation-exploration in the evaluation sequence. In this regard, Gaussian process (GP) has been a primary candidate of surrogate model, thanks to its Bayesian-principled uncertainty quantification power and modeling flexibility [30, 49]. However, it has a cubic-order scalability due to the inversion of the Gram kernel matrix in computing the posterior and, despite its principled design, it is conceivable to construct algorithms that could leverage problem structures more efficiently. To this end, there has been a surge of active studies to improve BO. These include the sparsification of GP using pseudo-inputs [55, 35, 59, 43, 28, 17] or kernel approximation [36] to reduce its time complexity, substitution of GP with other surrogates such as random forests [29] and neural networks [62, 60, 51, 71], and most recently the direct modeling of acquisition functions over the search space instead of through a surrogate [7, 65, 61].

Most of the state-of-the-art procedures mentioned above (and expanded further in Section 2) are practically motivated. In other words, they are shown to be superior via empirical comparisons, and designed via nicely intuited but ad hoc ideas. At the same time, the theory of BO has been largely confined to GP-based procedures. Among them, algorithmic consistency, i.e., convergence to the true optimum as the algorithm evolves indefinitely, can be attained with broader assumptions and algorithmic structures (but still GP-based; e.g., [40, 68, 21, 6]). Convergence rate results, while giving stronger conclusion, typically require more opaque assumptions, such as those associated with reproducing kernel Hilbert space (RKHS) and probability tail decays of the gradients of GPs, as well as more algorithmic restrictions such as upper confidence bound (UCB) instead of general acquisition functions [63, 32, 15]. In either case, there appears to be a mismatch between theory and practice: On one hand, practically superior algorithms do not have guarantees; on the other hand, BO theory is largely confined to GP and is not broad enough to cover more practical algorithms.

Given the above, our first high-level goal of this paper is to build a theory for exploration-based optimization that encompasses general algorithms beyond GP. This makes a step forward to reduce the gap between theory and practice in the realm of BO. To clarify our scope, by theory here we focus on algorithmic consistency – While this is weaker than rate results, algorithmic consistency is still largely open for procedures beyond GP, and moreover avoids the opaque assumptions and restrictive settings encountered in rate analysis. With this, our second high-level goal is to leverage our theory to locate strongly performing algorithms. Here, by strongly performing we mean algorithms that are empirically competitive against state-of-the-art benchmarks, while simultaneously have theoretical guarantees.

With these goals in mind, our main contribution is an axiomatic framework that elicits the minimal requirements on any exploration-based algorithms to guarantee convergence. Specifically, we dissect an exploration-based algorithm into three independent basic ingredients, surrogate predictor (SP), uncertainty quantifier (UQ), and acquisition function (AF). SP aims to provide point predictions at different design points. UQ quantifies the uncertainty of SP and indicates how reliable is the current prediction. AF transforms SP and UQ into the decision on which design point to evaluate next. We derive the axiomatic properties of SP, UQ and AF to attain theoretical convergence. In a nutshell, we call these properties local consistency, the sequential no-empty-ball property, and the improvement property respectively. These basic ingredients, along with their axiomatic properties, appear in GP-based algorithms – however, and as our key message, GP is not the only approach that exhibits these properties; instead, there are many more algorithms that could lead to similar convergence guarantees. We call our above framework Pseudo-Bayesian Optimization (PseudoBO). This is because, intuitively, we hinge on the Bayesian insight of BO in dissecting convergent procedures, but at the same time, a Bayesian perspective is generally not required for optimization convergence. Essentially, our framework extracts the minimally needed features in BO from the view of exploration-based optimization.

Our PseudoBO framework creates a recipe, with a list of SP, UQ and AF candidates that can be shown to satisfy the respective properties and hence altogether lead to algorithmic convergence. Our next endeavor is to assemble candidates in this recipe that perform competitively against existing benchmarks in empirical experiments. This resulting algorithm would then be one that exhibits both empirical and theoretical advantages over the benchmarks. In particular, we show how combining a simple local regression as SP, a suitably constructed “randomized prior” as UQ, and expected improvement (EI) as AF, can consistently outperform some state-of-the-art BO benchmarks across examples ranging from high-dimensional synthetic experiments to realistic hyperparameter tuning and robotic applications.

The remainder of this paper is as follows. Section 2 reviews related works on both practical and theoretical fronts. Section 3 presents the general theory of PseudoBO. Section 4 compiles the recipe that comprises a range of exemplifying ingredients under the PseudoBO framework. Section 5 discusses some details in turning our theory into implementation. Section 6 presents our empirical performances and comparisons with benchmark approaches. Section 7 concludes our paper and discusses future directions. All proofs and additional numerical details are presented in the Appendix.

2 Related Works

We review the existing literature. We categorize it roughly into two parts, one focusing on practical algorithms and implementations (Section 2.1) and one on theoretical guarantees (Section 2.2).

2.1 Literature on BO Practical Enhancements

We overview some existing approaches that aim to increase the scalability or efficiency of BO:

Sparse GP.  The first approach comprises sparse pseudo-input GP [55, 35, 59, 43, 28, 17]. The idea is to select m𝑚mitalic_m inducing pseudo-inputs by minimizing information loss, with m𝑚mitalic_m smaller than number of collected data n𝑛nitalic_n, to reduce the rank of the involved covariance matrix and subsequently the computational complexity. The second approach is GP with sparse spectrum [36]. By Bochner’s theorem [9], the kernel can be approximated by random features and thus GP can then be approximated by Bayesian linear regression with finite basis functions of size m𝑚mitalic_m. In terms of complexity, both approaches reduce the time complexity from 𝒪(n3)𝒪superscript𝑛3\mathcal{O}(n^{3})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) to 𝒪(nm2+m3)𝒪𝑛superscript𝑚2superscript𝑚3\mathcal{O}(nm^{2}+m^{3})caligraphic_O ( italic_n italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) for the posterior distribution updating. A hybrid method of speeding up the posterior sampling by combining the two approximations is [72]. However, the involved approximation and hence information loss may cause imprecise surrogate prediction or uncertainty quantification [58].

Surrogate Substitution.  Rather than working on scaling GP, there has been a surge of research on replacing BO with other surrogate models. SMAC [29] leverages random forest regressor and the standard deviation among trees to quantify uncertainty. However, the uncertainty quantification approach in this method, which utilizes bootstrap**, is unstable and can be overly small for unexplored areas. An alternative is to use neural networks for surrogate modeling. DNGO [60] adopts Bayesian linear regression on top of the the representations learnt by a neural network. Based on this, [51] proposes ABLR that improves the two-step learning in [60] to joint Bayesian learning but with additional computational complexity. BOHAMIANN [62] uses a modified Hamiltonian Monte Carlo to improve robustness and scalability of the surrogate model. [71] uses neural network ensemble. However, they require retraining the neural network after new data is collected, which is computationally expensive. Moreover, the performance appears sensitive to the network architecture [60].

Density Ratio Estimation-based Methods.  Another line of research focuses on directly modeling the acquisition function [7, 65, 61]. [7] proposes tree-structured Parzen estimator (TPE) by establishing the equivalence between expected improvement (EI) and the relative ratio between two densities, specifically for handling discrete and tree-structured inputs. Recently, BORE [65] is invented to estimate this density ratio directly with likelihood-free inference and LFBO [61] generalizes this framework to model any acquisition function in the form of expected utility.

Prior-data Fitted Networks (PFNs).  PFNs, initially proposed in [45], leverage in-context learning technique and Transformer architectures to approximate Bayesian inference. While PFNs significantly speed up the posterior inference, they require substantial pretraining. This pretraining procedure involves repeatedly drawing data points from the prior distribution to effectively learn the probabilistic predictions needed for Bayesian inference. Applications of PFNs to BO include [44] and [52].

Other Approaches.  There are a variety of other works to speed up the computation of BO. One way is to leverage batch acquisition function, by proposing points in batch to be evaluated at once. Ensemble Bayesian Optimization [70] employs an ensemble of additive GPs as well as a batch acquisition function to scale BO to tens of thousands of observations. Other batch acquisition functions are proposed [14, 69, 57, 73, 25]. Another line of works focus on dimension reduction, such as BOCK [47] and HeSBO [46]. Recently, TuRBO [20] is proposed to incorporate BO with the trust region method and batch acquisition through Thompson sampling [64]. In terms of implementation, GPyTorch [23] scales GP computation to thousands of queries, with conjugate descent to solve linear systems and Lanczos process to approximate the log-determinant. By leveraging hardware development, BoTorch [4] is invented as the state-of-the-art implementation of BO, speeding up the computation of acquisition with Monte Carlo sampling, sample average approximation (SAA) and computational technologies like auto-differentiation and parallel computation on CPUs and GPUs.

As explained in our introduction, the above approaches are practically oriented, in the sense that they are empirically attractive but can lack theoretical guarantee even on basic consistency. A contribution from our PseudoBO framework is our investigation of simple and cheap algorithms that can perform competitively against these benchmarks, while at the same time exhibit consistency guarantee. Furthermore, we also note that PseudoBO is not meant to separate from the above works, in the sense that these developed tools can be combined with PseudoBO to offer better performance or accelerate the computation even more. For example, our experiments (in Section 6) include the combination with Trust Region to achieve higher scalability like in TuRBO, and the inner optimization of AF via Sobel sequence.

2.2 Literature on BO Theory

We divide the study of BO algorithms with theoretical guarantees into the investigation of consistency and more elaborate convergence rate analysis.

Consistency.  [40] analyzes the consistency of BO with one-dimensional GP (in particular, Brownian motion) with EI. This was expanded upon in [68], which extends the consistency of EI to multi-dimensional settings, assuming the GP be stationary with the inverse of its spectral density exhibiting at most polynomial growth. Later, [6] generalizes the consistency to acquisitions of both EI and knowledge gradient (KG) for any Gaussian process with continuous sample paths. Besides, consistency results have also been established in various non-standard settings. For instance, [66] considers the objective as the sum or integral of multiple black-box functions. [2] considers objectives with inner network structures, where each node represents a black-box function with deterministic output. More generally, [1] studies composite objectives where a black-box function is composited with a function that has an explicitly known form. In the realm of simulation optimization, BO is applied to optimize expectation-form objective functions, and the simulator can generate aleatory uncertainties via possibly common random numbers, e.g., [50, 74]. Additionally, [67] considers black-box functions with input parameters that are unknown but observed via external data (i.e., the so-called input uncertainty in the stochastic simulation literature, e.g., [5, 16]). They propose Bayesian Information Collection and Optimization (BICO), which balances the trade-off between simulation and real data collection.

Convergence Rate Analysis.   This is cast commonly in the form of regret, i.e., the difference between the attained objective value and the oracle best objective value attained by the unknown ground-truth solution. [63] derives the first sub-linear cumulative regret in the noisy setting for UCB acquisition. [11] analyzes the regret for the EGO algorithm in the deterministic setting. Recently, [39] establishes a regret bound for UCB when the objective belongs to a parametric family of functions, and [37] gives a concentration bound especially when using ϵitalic-ϵ\epsilonitalic_ϵ-greedy procedures in radial basis function interpolations. Finally, we mention that there is a large literature on multiarmed bandit problems, including for instance [33, 53] that develop Bayesian regret bounds for so-called Thompson sampling. However, the discrete or linear structures of these problems appear different from continuous-space black-box optimization. In particular, their uncertainty typically arises from data noise, while in our setting there can be zero noise from data and the uncertainty comes solely from the lack of knowledge about the unexplored portion of the objective function.

Except [39] and [37], all works above on analyzing BO focus on GP-based procedures. In contrast, PseudoBO derives an axiomatic dissection on consistent exploration-based algorithms beyond GP. In this way, we expand the scope of algorithms above that satisfy consistency. On the other hand, we do not offer rate results as in the second line of works above. However, we note that the theory on consistency already appears largely open beyond GP, and moreover avoids the opaque assumptions typically needed for sharper rate analyses.

3 Theory of Pseudo-Bayesian Optimization

Suppose we are interested in solving the optimization problem maxx𝒳f(x)subscript𝑥𝒳𝑓𝑥\max_{x\in\mathcal{X}}f(x)roman_max start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_f ( italic_x ), where the objective function f:𝒳:𝑓𝒳f:\mathcal{X}\to\mathbb{R}italic_f : caligraphic_X → blackboard_R is unknown and 𝒳d𝒳superscript𝑑\mathcal{X}\subset\mathbb{R}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the decision space. The observations are deterministic. To introduce PseudoBO, we start with a simple but general algorithmic framework as follows. We optimize f𝑓fitalic_f by using sequential function evaluations, where selecting which point to evaluate next is guided by some evaluation worthiness (EW) measure, say Wn(x;𝒟n)subscript𝑊𝑛𝑥subscript𝒟𝑛W_{n}(x;\mathcal{D}_{n})italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). Here, n𝑛nitalic_n is the step index in the procedure, and 𝒟n={(x1,f(x1)),,(xn,f(xn))}subscript𝒟𝑛subscript𝑥1𝑓subscript𝑥1subscript𝑥𝑛𝑓subscript𝑥𝑛\mathcal{D}_{n}=\{(x_{1},f(x_{1})),\ldots,(x_{n},f(x_{n}))\}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) } denotes the collected data up to the n𝑛nitalic_n-th step. For convenience, we also denote Xn={x1,,xn}subscript𝑋𝑛subscript𝑥1subscript𝑥𝑛X_{n}=\{x_{1},\ldots,x_{n}\}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. In PseudoBO, at each step n𝑛nitalic_n we solve maxx𝒳Wn(x;𝒟n)subscript𝑥𝒳subscript𝑊𝑛𝑥subscript𝒟𝑛\max_{x\in\mathcal{X}}W_{n}(x;\mathcal{D}_{n})roman_max start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and the solution xn+1subscript𝑥𝑛1x_{n+1}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT is set as the next point to evaluate. Thus, in summary, the PseudoBO algorithm is:

  For each iteration n=1,2,,T𝑛12𝑇n=1,2,...,Titalic_n = 1 , 2 , … , italic_T:

      Evaluate f(xn)𝑓subscript𝑥𝑛f(x_{n})italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT );

      Update Wn(x;𝒟n)subscript𝑊𝑛𝑥subscript𝒟𝑛W_{n}(x;\mathcal{D}_{n})italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT );

       Set xn+1argmaxx𝒳Wn(x;𝒟n)subscript𝑥𝑛1subscriptargmax𝑥𝒳subscript𝑊𝑛𝑥subscript𝒟𝑛x_{n+1}\in\text{argmax}_{x\in\mathcal{X}}W_{n}(x;\mathcal{D}_{n})italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ argmax start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

The estimated optimal solution at any step n𝑛nitalic_n is x^nargmaxxXnf(x)superscriptsubscript^𝑥𝑛subscriptargmax𝑥subscript𝑋𝑛𝑓𝑥\hat{x}_{n}^{*}\in\text{argmax}_{x\in X_{n}}f(x)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ argmax start_POSTSUBSCRIPT italic_x ∈ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_x ) which gives the maximum evaluated function value so far.

3.1 Basic Algorithmic Consistency

Our first result characterizes the algorithmic consistency of PseudoBO. We denote Z=maxx𝒳f(x)superscript𝑍subscript𝑥𝒳𝑓𝑥Z^{*}=\max_{x\in\mathcal{X}}f(x)italic_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_f ( italic_x ) as the optimal value of the target problem, Δ(x,S)=minySxyΔ𝑥𝑆subscript𝑦𝑆norm𝑥𝑦\Delta(x,S)=\min_{y\in S}\|x-y\|roman_Δ ( italic_x , italic_S ) = roman_min start_POSTSUBSCRIPT italic_y ∈ italic_S end_POSTSUBSCRIPT ∥ italic_x - italic_y ∥ as the set distance from x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X to the set S𝒳𝑆𝒳S\subset\mathcal{X}italic_S ⊂ caligraphic_X, and Ef(S)={(x,f(x)):xS}subscript𝐸𝑓𝑆conditional-set𝑥𝑓𝑥𝑥𝑆E_{f}(S)=\{(x,f(x)):x\in S\}italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S ) = { ( italic_x , italic_f ( italic_x ) ) : italic_x ∈ italic_S } as the set of evaluated pairs (x,f(x))𝑥𝑓𝑥(x,f(x))( italic_x , italic_f ( italic_x ) ) for all xS𝑥𝑆x\in Sitalic_x ∈ italic_S.

Assumption 3.1 (Sequential no-empty-ball property).

Wn(;)subscript𝑊𝑛W_{n}(\cdot\ ;\ \cdot)italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ⋅ ; ⋅ ) satisfies the following:

  1. 1.

    For any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X, if infnΔ(x,Sn)>0subscriptinfimum𝑛Δ𝑥subscript𝑆𝑛0\inf_{n}\Delta(x,S_{n})>0roman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0, then lim infnWn(x;An)>0subscriptlimit-infimum𝑛subscript𝑊𝑛𝑥subscript𝐴𝑛0\liminf_{n\to\infty}W_{n}(x;A_{n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ; italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0 where An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

  2. 2.

    For any convergent sequence xn𝒳subscript𝑥𝑛𝒳x_{n}\in\mathcal{X}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_X, i.e., xnxsubscript𝑥𝑛superscript𝑥x_{n}\to x^{\prime}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for some x𝒳superscript𝑥𝒳x^{\prime}\in\mathcal{X}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X, we have Wn(xn;An1𝒟n1)0subscript𝑊𝑛subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛10W_{n}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to 0italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) → 0, where 𝒟n1={(x1,f(x1)),,(xn1,f(xn1))}subscript𝒟𝑛1subscript𝑥1𝑓subscript𝑥1subscript𝑥𝑛1𝑓subscript𝑥𝑛1\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ) } and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for any finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X.

Roughly speaking, part 1 of Assumption 3.1 stipulates that as along as there is no infinitesimally close evaluated point in the neighborhood of x𝑥xitalic_x, the EW of x𝑥xitalic_x is positive. Part 2 plays a converse role to state that if x𝑥xitalic_x entails an approaching sequence, then the EW of x𝑥xitalic_x would shrink to 0, and this is true with or without any additional data represented by the set sequence {An}subscript𝐴𝑛\{A_{n}\}{ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. The finite cardinality condition on the considered set sequences is imposed since we focus on realistic algorithms that can evaluate only a finite number of points, even though conceptually the assumption can be relaxed to include any set sequences (but in this case the EW for specific examples need to be properly defined). Note that Assumption 3.1 is purely about the function Wnsubscript𝑊𝑛W_{n}italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and its interaction with f𝑓fitalic_f, and does not assume anything about the optimization procedure. We call Assumption 3.1 the sequential no-empty-ball (SNEB) property, where the no-empty-ball (NEB) notion follows from [68] and indicates that a zero value of Wnsubscript𝑊𝑛W_{n}italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT at a point x𝑥xitalic_x means any ball surrounding x𝑥xitalic_x must contain some evaluated points in the past data, and vice versa. It is “sequential” because part 2 of the assumption modifies the original NEB property to consider the value of Wnsubscript𝑊𝑛W_{n}italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT at the sequence xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT instead of its limit xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as in [68]. Moreover, we also allow Wnsubscript𝑊𝑛W_{n}italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to be step-dependent, i.e., depend on step n𝑛nitalic_n. The former modification facilitates the argument of our basic guarantee, while the latter is useful to apply to some important examples as we will see in the sequel.

We introduce our first theoretical result: As long as the EW W𝑊Witalic_W satisfies the SNEB property, PseudoBO asymptotically reaches the true optimal value, or in other words it is algorithmically consistent.

Theorem 3.2 (Algorithmic consistency of PseudoBO).

Suppose EW Wnsubscript𝑊𝑛W_{n}italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT satisfies Assumption 3.1 and 𝒳𝒳\mathcal{X}caligraphic_X is compact. Then:

  • 𝒳𝒳\mathcal{X}caligraphic_X is eventually populated by the PseudoBO iterates, i.e., for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, we have infnΔ(x,Xn)=0subscriptinfimum𝑛Δ𝑥subscript𝑋𝑛0\inf_{n}\Delta(x,X_{n})=0roman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = 0 for Xn={x1,,xn}subscript𝑋𝑛subscript𝑥1subscript𝑥𝑛X_{n}=\{x_{1},\ldots,x_{n}\}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } where xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the n𝑛nitalic_n-th output iterate in the PseudoBO algorithm.

  • Consequently, if furthermore f𝑓fitalic_f is continuous, then PseudoBO is algorithmically consistent, i.e., the estimated optimal solution x^nargmaxxXnf(x)superscriptsubscript^𝑥𝑛subscriptargmax𝑥subscript𝑋𝑛𝑓𝑥\hat{x}_{n}^{*}\in\text{argmax}_{x\in X_{n}}f(x)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ argmax start_POSTSUBSCRIPT italic_x ∈ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_x ) satisfies f(x^n)Z𝑓superscriptsubscript^𝑥𝑛superscript𝑍f(\hat{x}_{n}^{*})\to Z^{*}italic_f ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) → italic_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as n𝑛n\to\inftyitalic_n → ∞.

Theorem 3.2 reveals the exploration-based nature of PseudoBO: It achieves convergence by populating the search space. More precisely, the first part of the theorem concludes eventual popularization, which means that any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X has arbitrarily close evaluated points from the PseudoBO iterates eventually. By selecting the historically best evaluated point, this popularization then turns into asymptotic convergence to the optimal value, which is the second part of the theorem. Here, while EW can incorporate many sources of information, a requirement is that it must contain information about the local popularity to guide us in this space popularization. In reality, we would like to make guesses and evaluate at points that are likely close to the best (exploitation), but also be cautious about missing out other potentially good regions (exploration). Accounting for this tradeoff requires a more specialized framework that contains ingredients to handle this issue more explicitly. We will describe these ingredients in the next subsection.

3.2 A More Specialized Framework

We consider a more specialized version of PseudoBO that materializes EW via three ingredients: surrogate predictor (SP), uncertainty quantifier (UQ) and acquisition function (AF). As discussed earlier, these ingredients appear in GP-based algorithms in BO but could be designed substantially more generally. Each of these ingredients needs to satisfy its own basic, independent, assumption, which we call local consistency, SNEB (introduced before) and the improvement property respectively.

Assumption 3.3 (Local consistency of SP).

The SP f^(;):𝒳×Ef(𝒳):^𝑓𝒳subscript𝐸𝑓𝒳\hat{f}(\cdot;\cdot):\mathcal{X}\times E_{f}(\mathcal{X})\to\mathbb{R}over^ start_ARG italic_f end_ARG ( ⋅ ; ⋅ ) : caligraphic_X × italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_X ) → blackboard_R satisfies that for any convergent sequence {xn}𝒳subscript𝑥𝑛𝒳\{x_{n}\}\subset\mathcal{X}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ⊂ caligraphic_X, i.e., xnxsubscript𝑥𝑛superscript𝑥x_{n}\to x^{\prime}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for some x𝒳superscript𝑥𝒳x^{\prime}\in\mathcal{X}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X, we have f^(xn;An1𝒟n1)f(x)^𝑓subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to f(x^{\prime})over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) → italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where 𝒟n1={(x1,f(x1)),,(xn1,f(xn1))}subscript𝒟𝑛1subscript𝑥1𝑓subscript𝑥1subscript𝑥𝑛1𝑓subscript𝑥𝑛1\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ) } and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for any set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\in\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_X.

In Assumption 3.3, f^(x;𝒟)^𝑓𝑥𝒟\hat{f}(x;\mathcal{D})over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D ) represents the predictor at x𝑥xitalic_x using data 𝒟𝒟\mathcal{D}caligraphic_D. This assumption stipulates that the true function value at a target point can be approximated with increasing precision by f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG constructed at evaluation points converging to this target, with the historically evaluated points and any additional data.

Assumption 3.4 (SNEB property of UQ).

The UQ σ^(;):𝒳×Ef(𝒳):^𝜎𝒳subscript𝐸𝑓𝒳\hat{\sigma}(\cdot;\cdot):\mathcal{X}\times E_{f}(\mathcal{X})\to\mathbb{R}over^ start_ARG italic_σ end_ARG ( ⋅ ; ⋅ ) : caligraphic_X × italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_X ) → blackboard_R satisfies:

  1. 1.

    For any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X, if infnΔ(x,Sn)>0subscriptinfimum𝑛Δ𝑥subscript𝑆𝑛0\inf_{n}\Delta(x,S_{n})>0roman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0, then lim infnσ^(x;An)>0subscriptlimit-infimum𝑛^𝜎𝑥subscript𝐴𝑛0\liminf_{n\to\infty}\hat{\sigma}(x;A_{n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0 where An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

  2. 2.

    For any convergent sequence xn𝒳subscript𝑥𝑛𝒳x_{n}\in\mathcal{X}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_X, i.e., xnxsubscript𝑥𝑛superscript𝑥x_{n}\to x^{\prime}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for some x𝒳superscript𝑥𝒳x^{\prime}\in\mathcal{X}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X, we have σ^(xn;An1𝒟n1)0^𝜎subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛10\hat{\sigma}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to 0over^ start_ARG italic_σ end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) → 0, where 𝒟n1={(x1,f(x1)),,(xn1,f(xn1))}subscript𝒟𝑛1subscript𝑥1𝑓subscript𝑥1subscript𝑥𝑛1𝑓subscript𝑥𝑛1\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ) } and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for any finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X.

Note that the SNEB property in Assumption 3.4 on the UQ is exactly the same as Assumption 3.1. This highlights the key role of UQ as the driver of exploration and ultimately solution convergence via the EW framework. However, by incorporating the SP via the AF (discussed momentarily), we can induce exploitation to enhance practical performances.

Assumption 3.5 (Improvement property of AF).

The AF gn(,):×+:subscript𝑔𝑛subscriptg_{n}(\cdot,\cdot):\mathbb{R}\times\mathbb{R}_{+}\to\mathbb{R}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ⋅ , ⋅ ) : blackboard_R × blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT → blackboard_R satisfies the following (where pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are any real sequences):

  1. 1.

    lim infngn(pn,qn)>0subscriptlimit-infimum𝑛subscript𝑔𝑛subscript𝑝𝑛subscript𝑞𝑛0\liminf_{n\to\infty}g_{n}(p_{n},q_{n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0 if lim infnpn>subscriptlimit-infimum𝑛subscript𝑝𝑛\liminf_{n\to\infty}p_{n}>-\inftylim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > - ∞ and lim infnqn>0subscriptlimit-infimum𝑛subscript𝑞𝑛0\liminf_{n\to\infty}q_{n}>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > 0.

  2. 2.

    gn(pn,qn)0subscript𝑔𝑛subscript𝑝𝑛subscript𝑞𝑛0g_{n}(p_{n},q_{n})\to 0italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → 0 if lim supnpn0subscriptlimit-supremum𝑛subscript𝑝𝑛0\limsup_{n\to\infty}p_{n}\leq 0lim sup start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ 0 and qn0subscript𝑞𝑛0q_{n}\to 0italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → 0.

AF can be viewed as a channel to convert SP and UQ into EW. That is, the higher is the output of gnsubscript𝑔𝑛g_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is, the more worthy to evaluate is the considered point. In Assumption 3.5, pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the argument for the potential improvement regarding point estimation, and qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the argument for the uncertainty. Part 2 of the assumption states that if there is, with eventual certainty, no improvement, then the worthiness to evaluate becomes zero. In contrast, part 1 stipulates that, as long as there is uncertainty, then there is some worthiness to evaluate the considered point (note that the condition lim infnpn>subscriptlimit-infimum𝑛subscript𝑝𝑛\liminf_{n\to\infty}p_{n}>-\inftylim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > - ∞ there is largely a technicality that avoids pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT being unboundedly negative). Finally, we allow AF to be step-dependent which is intended to make our PseudoBO framework general enough to cover common existing algorithms.

We are now ready to put together all the above ingredients into algorithmic consistency. First, for a set 𝒟Ef(𝒳)𝒟subscript𝐸𝑓𝒳\mathcal{D}\subset E_{f}(\mathcal{X})caligraphic_D ⊂ italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_X ), denote Πf(𝒟)={y:(x,y)𝒟 for some x𝒳}subscriptΠ𝑓𝒟conditional-set𝑦𝑥𝑦𝒟 for some 𝑥𝒳\Pi_{f}(\mathcal{D})=\{y:(x,y)\in\mathcal{D}\text{\ for some\ }x\in\mathcal{X}\}roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D ) = { italic_y : ( italic_x , italic_y ) ∈ caligraphic_D for some italic_x ∈ caligraphic_X } as the projection of 𝒟𝒟\mathcal{D}caligraphic_D onto the output dimension.

Theorem 3.6 (From SP+UQ+AF to EW).

Suppose SP f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG, UQ σ^^𝜎\hat{\sigma}over^ start_ARG italic_σ end_ARG, and AF gnsubscript𝑔𝑛g_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT satisfy Assumptions 3.3, 3.4 and 3.5 respectively. Suppose also that f𝑓fitalic_f is continuous. Then the EW constructed by Wn(x;𝒟n)=gn(ζ(f^(x;𝒟n)maxΠf(𝒟n)),σ^(x;𝒟n))subscript𝑊𝑛𝑥subscript𝒟𝑛subscript𝑔𝑛𝜁^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛^𝜎𝑥subscript𝒟𝑛W_{n}(x;\mathcal{D}_{n})=g_{n}(\zeta(\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(% \mathcal{D}_{n})),\hat{\sigma}(x;\mathcal{D}_{n}))italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ζ ( over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) , over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ), where ζ()𝜁\zeta(\cdot)italic_ζ ( ⋅ ) is continuous and non-decreasing and ζ(0)0𝜁00\zeta(0)\leq 0italic_ζ ( 0 ) ≤ 0, satisfies Assumption 3.1.

Based on Theorems 3.2 and 3.6, we obtain the following guarantee:

Corollary 3.7 (Algorithmic consistency via SP+UQ+AF).

Under the same assumptions as Theorem 3.6, for any compact 𝒳𝒳\mathcal{X}caligraphic_X, PseudoBO with EW constructed by Wn(x;𝒟n)=gn(ζ(f^(x;𝒟n)maxΠf(𝒟n))),σ^(x;𝒟n))W_{n}(x;\mathcal{D}_{n})=g_{n}(\zeta(\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(% \mathcal{D}_{n}))),\hat{\sigma}(x;\mathcal{D}_{n}))italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ζ ( over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) ) , over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) is algorithmically consistent.

3.3 (δ,ϵ)𝛿italic-ϵ(\delta,\epsilon)( italic_δ , italic_ϵ )-Relaxation of PseudoBO

The above PseudoBO framework guarantees eventual popularization and subsequently algorithmic consistency. In this subsection, we relax the popularization requirement to only a certain resolution, i.e., any point in 𝒳𝒳\mathcal{X}caligraphic_X has eventually evaluated points within say a δ𝛿\deltaitalic_δ-sized neighborhood. This in turn leads to algorithmic consistency up to an error relating to δ𝛿\deltaitalic_δ. We study this relaxation for two reasons. One is that, in many problems, it makes sense to aim for near-optimality instead of exact optimality, because a small optimality gap can play a negligible practical role which is not cost-effective to close in. Second, and more importantly, we will see that some SPs are in fact fundamentally accurate only up to a certain level of error. For such SPs, their pairing UQs are naturally SNEB up to an associated relaxation level, which in turn induces the δ𝛿\deltaitalic_δ-relaxed eventual popularization.

We first state a relaxation of the SNEB property.

Assumption 3.8 (δ𝛿\deltaitalic_δ-relaxed SNEB property).

For a given δ0𝛿0\delta\geq 0italic_δ ≥ 0, Wn(;)subscript𝑊𝑛W_{n}(\cdot\ ;\ \cdot)italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ⋅ ; ⋅ ) satisfies the following:

  1. 1.

    For any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X, if infnΔ(x,Sn)>δsubscriptinfimum𝑛Δ𝑥subscript𝑆𝑛𝛿\inf_{n}\Delta(x,S_{n})>\deltaroman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ, then lim infnWn(x;An)>0subscriptlimit-infimum𝑛subscript𝑊𝑛𝑥subscript𝐴𝑛0\liminf_{n\to\infty}W_{n}(x;A_{n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ; italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0 where An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

  2. 2.

    For any convergent sequence xn𝒳subscript𝑥𝑛𝒳x_{n}\in\mathcal{X}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_X, i.e., xnxsubscript𝑥𝑛superscript𝑥x_{n}\to x^{\prime}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for some x𝒳superscript𝑥𝒳x^{\prime}\in\mathcal{X}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X, we have Wn(xn;An1𝒟n1)0subscript𝑊𝑛subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛10W_{n}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to 0italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) → 0, where 𝒟n1={(x1,f(x1)),,(xn1,f(xn1))}subscript𝒟𝑛1subscript𝑥1𝑓subscript𝑥1subscript𝑥𝑛1𝑓subscript𝑥𝑛1\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ) } and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for any finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X.

Compared to Assumption 3.1, Assumption 3.8 only requires that lim infnWn(x;An)>0subscriptlimit-infimum𝑛subscript𝑊𝑛𝑥subscript𝐴𝑛0\liminf_{n\to\infty}W_{n}(x;A_{n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ; italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0 if infnΔ(x,Sn)>δsubscriptinfimum𝑛Δ𝑥subscript𝑆𝑛𝛿\inf_{n}\Delta(x,S_{n})>\deltaroman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ instead of infnΔ(x,Sn)>0subscriptinfimum𝑛Δ𝑥subscript𝑆𝑛0\inf_{n}\Delta(x,S_{n})>0roman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0. That is, the point has positive EW if it is sufficiently, i.e., δ𝛿\deltaitalic_δ unit, far away from any eventually evaluated points. All other parts of the assumption remain the same as before. Assumption 3.8 is weaker than Assumption 3.1 when δ>0𝛿0\delta>0italic_δ > 0, and reduces back to Assumption 3.1 when δ=0𝛿0\delta=0italic_δ = 0.

With the above updated assumption, we obtain a corresponding relaxed version of algorithmic consistency. For this, we need to strengthen the continuity assumption of the objective function f𝑓fitalic_f to Lipschitzness continuity. We call f𝑓fitalic_f L𝐿Litalic_L-Lipschitz if |f(x)f(x)|Lxx𝑓𝑥𝑓superscript𝑥𝐿norm𝑥superscript𝑥|f(x)-f(x^{\prime})|\leq L\|x-x^{\prime}\|| italic_f ( italic_x ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≤ italic_L ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ for any x,x𝒳𝑥superscript𝑥𝒳x,x^{\prime}\in\mathcal{X}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X for some given constant L>0𝐿0L>0italic_L > 0.

Theorem 3.9 (δ𝛿\deltaitalic_δ-relaxed algorithmic consistency of PseudoBO).

Suppose EW Wnsubscript𝑊𝑛W_{n}italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT satisfies Assumption 3.8 and 𝒳𝒳\mathcal{X}caligraphic_X is compact. Then:

  • 𝒳𝒳\mathcal{X}caligraphic_X is eventually populated by the PseudoBO iterates up to δ𝛿\deltaitalic_δ-neighborhoods, i.e., for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, we have infnΔ(x,Xn)δsubscriptinfimum𝑛Δ𝑥subscript𝑋𝑛𝛿\inf_{n}\Delta(x,X_{n})\leq\deltaroman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_δ for Xn={x1,,xn}subscript𝑋𝑛subscript𝑥1subscript𝑥𝑛X_{n}=\{x_{1},\ldots,x_{n}\}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } where xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the n𝑛nitalic_n-th output iterate in the PseudoBO algorithm.

  • Consequently, if furthermore f𝑓fitalic_f is L𝐿Litalic_L-Lipschitz continuous, then PseudoBO is algorithmically consistent up to Lδ𝐿𝛿L\deltaitalic_L italic_δ error, i.e., the estimated optimal solution x^nargmaxxXnf(x)superscriptsubscript^𝑥𝑛subscriptargmax𝑥subscript𝑋𝑛𝑓𝑥\hat{x}_{n}^{*}\in\text{argmax}_{x\in X_{n}}f(x)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ argmax start_POSTSUBSCRIPT italic_x ∈ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_x ) satisfies lim infnf(x^n)ZLδsubscriptlimit-infimum𝑛𝑓superscriptsubscript^𝑥𝑛superscript𝑍𝐿𝛿\liminf_{n}f(\hat{x}_{n}^{*})\geq Z^{*}-L\deltalim inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_f ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_L italic_δ.

Like Theorem 3.2, part 1 of Theorem 3.9 concludes that the decision space 𝒳𝒳\mathcal{X}caligraphic_X is eventually populated, but now up to δ𝛿\deltaitalic_δ-sized neighborhoods. Part 2 then translates this popularization to algorithmic consistency, now with an error that depends on δ𝛿\deltaitalic_δ and the Lipschitz constant L𝐿Litalic_L.

Next, like Section 3.2, we specialize our relaxed framework to the use of SP, UQ and AF. We first consider a relaxed version of local consistency of SP.

Assumption 3.10 (ϵitalic-ϵ\epsilonitalic_ϵ-relaxed local consistency of SP).

For a given ϵ0italic-ϵ0\epsilon\geq 0italic_ϵ ≥ 0, the SP f^(;):𝒳×Ef(𝒳):^𝑓𝒳subscript𝐸𝑓𝒳\hat{f}(\cdot;\cdot):\mathcal{X}\times E_{f}(\mathcal{X})\to\mathbb{R}over^ start_ARG italic_f end_ARG ( ⋅ ; ⋅ ) : caligraphic_X × italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_X ) → blackboard_R satisfies that for any convergent sequence xn𝒳subscript𝑥𝑛𝒳x_{n}\in\mathcal{X}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_X, i.e., xnxsubscript𝑥𝑛superscript𝑥x_{n}\to x^{\prime}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for some x𝒳superscript𝑥𝒳x^{\prime}\in\mathcal{X}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X, we have

lim supn|f^(xn;An1𝒟n1)f(x)|ϵsubscriptlimit-supremum𝑛^𝑓subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥italic-ϵ\limsup_{n}|\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-f(x^{\prime})|\leq\epsilonlim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≤ italic_ϵ

where 𝒟n1={(x1,f(x1)),,(xn1,f(xn1))}subscript𝒟𝑛1subscript𝑥1𝑓subscript𝑥1subscript𝑥𝑛1𝑓subscript𝑥𝑛1\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ) } and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for any finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\in\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_X.

Compared to Assumption 3.3, Assumption 3.10 stipulates that f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG is accurate only up to an ϵitalic-ϵ\epsilonitalic_ϵ error even when there is a sequence of evaluated points that gets infinitesimally close. That is, the predictor is fundamentally erroneous at level ϵitalic-ϵ\epsilonitalic_ϵ. Note that, when ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0, Assumption 3.10 reduces back to Assumption 3.3.

Next, the following is the relaxed version of SNEB for the UQ σ^^𝜎\hat{\sigma}over^ start_ARG italic_σ end_ARG, which is the same as that for EW in Assumption 3.8.

Assumption 3.11 (δ𝛿\deltaitalic_δ-relaxed SNEB property of UQ).

For a given δ0𝛿0\delta\geq 0italic_δ ≥ 0, the UQ σ^(;):𝒳×Ef(𝒳):^𝜎𝒳subscript𝐸𝑓𝒳\hat{\sigma}(\cdot;\cdot):\mathcal{X}\times E_{f}(\mathcal{X})\to\mathbb{R}over^ start_ARG italic_σ end_ARG ( ⋅ ; ⋅ ) : caligraphic_X × italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_X ) → blackboard_R satisfies:

  1. 1.

    For any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X, if infnΔ(x,Sn)>δsubscriptinfimum𝑛Δ𝑥subscript𝑆𝑛𝛿\inf_{n}\Delta(x,S_{n})>\deltaroman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ, then lim infnσ^(x;An)>0subscriptlimit-infimum𝑛^𝜎𝑥subscript𝐴𝑛0\liminf_{n\to\infty}\hat{\sigma}(x;A_{n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0 where An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

  2. 2.

    For any convergent sequence xn𝒳subscript𝑥𝑛𝒳x_{n}\in\mathcal{X}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_X, i.e., xnxsubscript𝑥𝑛superscript𝑥x_{n}\to x^{\prime}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for some x𝒳superscript𝑥𝒳x^{\prime}\in\mathcal{X}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X, we have σ^(xn;An1𝒟n1)0^𝜎subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛10\hat{\sigma}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to 0over^ start_ARG italic_σ end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) → 0, where 𝒟n1={(x1,f(x1)),,(xn1,f(xn1))}subscript𝒟𝑛1subscript𝑥1𝑓subscript𝑥1subscript𝑥𝑛1𝑓subscript𝑥𝑛1\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ) } and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for any finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X.

Finally, the improvement property of AF remains the same under the relaxed framework as before, i.e., we would still use Assumption 3.5 for AF. We then have the following guarantee.

Theorem 3.12 (From SP+UQ+AF to EW under (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-relaxation).

Suppose SP f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG, UQ σ^^𝜎\hat{\sigma}over^ start_ARG italic_σ end_ARG, and AF gnsubscript𝑔𝑛g_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT satisfy Assumptions 3.10, 3.11 and 3.5 respectively. Suppose also that f𝑓fitalic_f is continuous. Then the EW constructed by Wn(x;𝒟n)=gn(ζ(f^(x;𝒟n)maxΠf(𝒟n)ϵ),σ^(x;𝒟n))subscript𝑊𝑛𝑥subscript𝒟𝑛subscript𝑔𝑛𝜁^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛italic-ϵ^𝜎𝑥subscript𝒟𝑛W_{n}(x;\mathcal{D}_{n})=g_{n}(\zeta(\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(% \mathcal{D}_{n})-\epsilon),\hat{\sigma}(x;\mathcal{D}_{n}))italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ζ ( over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_ϵ ) , over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ), where ζ()𝜁\zeta(\cdot)italic_ζ ( ⋅ ) is continuous and non-decreasing and ζ(0)0𝜁00\zeta(0)\leq 0italic_ζ ( 0 ) ≤ 0, satisfies Assumption 3.8.

Compared to Theorem 3.6, in Theorem 3.12 we use the relaxed versions of the local consistency and SNEB properties. Moreover, we use ζ(f^(x;𝒟n)maxΠf(𝒟n)ϵ)𝜁^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛italic-ϵ\zeta(\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})-\epsilon)italic_ζ ( over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_ϵ ) instead of ζ(f^(x;𝒟n)maxΠf(𝒟n))𝜁^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛\zeta(\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n}))italic_ζ ( over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) in the first argument of gn(,)subscript𝑔𝑛g_{n}(\cdot,\cdot)italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ⋅ , ⋅ ) when defining the EW. Recall that the first argument of gn(,)subscript𝑔𝑛g_{n}(\cdot,\cdot)italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ⋅ , ⋅ ) represents the potential improvement regarding point estimation. When the SP has a fundamental inaccuracy of ϵitalic-ϵ\epsilonitalic_ϵ unit, we naturally consider potential improvement only up to the same amount of prediction inaccuracy, thus leading to the extra ϵitalic-ϵ\epsilonitalic_ϵ-reduction in this improvement calculation.

Lastly, based on Theorems 3.9 and 3.12, we obtain the following guarantee:

Corollary 3.13 (Algorithmic consistency via SP+UQ+AF under (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-relaxation).

Under the same assumptions as Theorem 3.12, for any compact 𝒳𝒳\mathcal{X}caligraphic_X and assuming additionally that f𝑓fitalic_f is L𝐿Litalic_L-Lipschitz, PseudoBO with EW constructed by Wn(x;𝒟n)=gn(ζ(f^(x;𝒟n)maxΠf(𝒟n))ϵ),σ^(x;𝒟n))W_{n}(x;\mathcal{D}_{n})=g_{n}(\zeta(\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(% \mathcal{D}_{n}))-\epsilon),\hat{\sigma}(x;\mathcal{D}_{n}))italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ζ ( over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) - italic_ϵ ) , over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) is algorithmically consistent up to Lδ𝐿𝛿L\deltaitalic_L italic_δ error.

Corollary 3.13 concludes that, when we define the potential improvement in a way that addresses the ϵitalic-ϵ\epsilonitalic_ϵ prediction inaccuracy, and the UQ is obtained with a δ𝛿\deltaitalic_δ-relaxed SNEB property, the resulting PseudoBO procedure would have algorithmic consistency up to an Lδ𝐿𝛿L\deltaitalic_L italic_δ error. Note that when ϵ=δ=0italic-ϵ𝛿0\epsilon=\delta=0italic_ϵ = italic_δ = 0, Corollary 3.13 reduces back to the non-relaxed case in Corollary 3.7. In fact, we note that the requirement of ϵitalic-ϵ\epsilonitalic_ϵ and δ𝛿\deltaitalic_δ are separate, i.e., the corollary holds even if only one of ϵitalic-ϵ\epsilonitalic_ϵ and δ𝛿\deltaitalic_δ is non-zero. That is to say, we can opt to aim for popularization up to δ𝛿\deltaitalic_δ-neighborhoods, and consequently Lδ𝐿𝛿L\deltaitalic_L italic_δ-optimality, when the SP has no fundamental error. Similarly, we might have an ϵitalic-ϵ\epsilonitalic_ϵ-inaccurate SP, but we choose an UQ that has a precise SNEB property instead of being δ𝛿\deltaitalic_δ-relaxed. This is indeed possible; however, as we will see in the next section, some natural choices of UQs that couple with an ϵitalic-ϵ\epsilonitalic_ϵ-relaxed locally consistent SP would only bear the δ𝛿\deltaitalic_δ-relaxed SNEB property. This latter behavior deems the relaxation on both the local consistency and the SNEB property important and naturally coupled.

We close this section by explaining why it is important to consider SPs that are ϵitalic-ϵ\epsilonitalic_ϵ-relaxed locally consistent. This arises from the fact that many common machine learning predictors contain hyperparameters typically needed to be tuned in relation to the sample size. Such tuning would ensure the predictor is locally consistent, but only if the design points are sampled according to certain distributions. As we define our local consistency (Assumptions 3.3 and 3.10) in a way that is free of any distributional assumption on the {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } sequence, standard tuning approaches that aim for consistency would not apply. Our ϵitalic-ϵ\epsilonitalic_ϵ-relaxation serves to remedy this issue since, to achieve the relaxed local consistency, we can simply choose a fixed hyperparameter value instead of scaling it with the number of evaluated points.

4 The PseudoBO Cookbook

We present a range of examples for SP, UQ and AF to demonstrate the generality of PseudoBO and how it applies to existing algorithms as well as new ones. Importantly, it also paves the way for us to select practically superior algorithms. Our results can be summarized as a PseudoBO recipe in Figure 1.

Refer to caption
Fig. 1: A general recipe for configuring a PseudoBO algorithm.

4.1 SP with Local Consistency

The essence of local consistency is that the SP can correctly estimate the objective value with sufficient data around, given continuity of the objective. An exemplifying type of locally consistent SP is the class of interpolating models, which give prediction values exactly as the evaluated values for all the evaluated points, and similar to the values of the surrounding evaluated points otherwise. GP posterior mean, nearest neighbor and over-parameterized neural networks are examples. For our discussion below, we recall the notation Xn={x1,,xn}𝒳subscript𝑋𝑛subscript𝑥1subscript𝑥𝑛𝒳X_{n}=\{x_{1},\ldots,x_{n}\}\subset\mathcal{X}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ⊂ caligraphic_X as a finite set of points that we select to evaluate, and 𝒟n={(x1,f(x1)),,(xn,f(xn))}subscript𝒟𝑛subscript𝑥1𝑓subscript𝑥1subscript𝑥𝑛𝑓subscript𝑥𝑛\mathcal{D}_{n}=\{(x_{1},f(x_{1})),\ldots,(x_{n},f(x_{n}))\}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) } as the collection of evaluated pairs.

We start with GP posterior mean predictor. To define this, we first choose a positive semi-definite kernel function, 𝒦:d×d:𝒦superscript𝑑superscript𝑑\mathcal{K}:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}caligraphic_K : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R, and a mean function, μ0:d:subscript𝜇0superscript𝑑\mu_{0}:\mathbb{R}^{d}\to\mathbb{R}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R. The mean function is commonly a constant function μ0(x)=0subscript𝜇0𝑥0\mu_{0}(x)=0italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) = 0 for centered GPs or a low-order polynomial [22]. Given data 𝒟nsubscript𝒟𝑛\mathcal{D}_{n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the GP posterior mean predictor is defined as

f^GP(x0;𝒟n)=𝒦(x0,Xn)𝒦(Xn,Xn)1(f(Xn)μ0(Xn))+μ0(x0),subscript^𝑓GPsubscript𝑥0subscript𝒟𝑛𝒦subscript𝑥0subscript𝑋𝑛𝒦superscriptsubscript𝑋𝑛subscript𝑋𝑛1𝑓subscript𝑋𝑛subscript𝜇0subscript𝑋𝑛subscript𝜇0subscript𝑥0\hat{f}_{\text{GP}}(x_{0};\mathcal{D}_{n})=\mathcal{K}(x_{0},X_{n})\mathcal{K}% (X_{n},X_{n})^{-1}(f(X_{n})-\mu_{0}(X_{n}))+\mu_{0}(x_{0}),over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = caligraphic_K ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) caligraphic_K ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_f ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) + italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (1)

where 𝒦(x0,Xn):=[𝒦(x0,x1),,𝒦(x0,xn)]assign𝒦subscript𝑥0subscript𝑋𝑛𝒦subscript𝑥0subscript𝑥1𝒦subscript𝑥0subscript𝑥𝑛\mathcal{K}(x_{0},X_{n}):=[\mathcal{K}(x_{0},x_{1}),...,\mathcal{K}(x_{0},x_{n% })]caligraphic_K ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) := [ caligraphic_K ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , caligraphic_K ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ], 𝒦(Xn,Xn):=[𝒦(xi,xj)]1i,jnassign𝒦subscript𝑋𝑛subscript𝑋𝑛subscriptdelimited-[]𝒦subscript𝑥𝑖subscript𝑥𝑗formulae-sequence1𝑖𝑗𝑛\mathcal{K}(X_{n},X_{n}):=[\mathcal{K}(x_{i},x_{j})]_{1\leq i,j\leq n}caligraphic_K ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) := [ caligraphic_K ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT 1 ≤ italic_i , italic_j ≤ italic_n end_POSTSUBSCRIPT, μ0(Xn):=[μ0(x1),,μ0(xn)]assignsubscript𝜇0subscript𝑋𝑛subscript𝜇0subscript𝑥1subscript𝜇0subscript𝑥𝑛\mu_{0}(X_{n}):=[\mu_{0}(x_{1}),...,\mu_{0}(x_{n})]italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) := [ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] and f(Xn):=[f(x1),,f(xn)]assign𝑓subscript𝑋𝑛𝑓subscript𝑥1𝑓subscript𝑥𝑛f(X_{n}):=[f(x_{1}),...,f(x_{n})]italic_f ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) := [ italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ], for a test point x0𝒳subscript𝑥0𝒳x_{0}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X. The kernel function 𝒦𝒦\mathcal{K}caligraphic_K and mean function μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT correspond to the prior covariance and mean of the GP, and with data 𝒟nsubscript𝒟𝑛\mathcal{D}_{n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT we output the posterior mean of the GP as the SP (1). Note that the GP posterior mean predictor is an interpolator. Specifically, for any observed data point (xi,f(xi))𝒟nsubscript𝑥𝑖𝑓subscript𝑥𝑖subscript𝒟𝑛(x_{i},f(x_{i}))\in\mathcal{D}_{n}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we have f^GP(xi;𝒟n)=f(xi)subscript^𝑓𝐺𝑃subscript𝑥𝑖subscript𝒟𝑛𝑓subscript𝑥𝑖\hat{f}_{GP}(x_{i};\mathcal{D}_{n})=f(x_{i})over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The intuition behind this is that K(xi,Xn)K(Xn,Xn)1=ei𝐾subscript𝑥𝑖subscript𝑋𝑛𝐾superscriptsubscript𝑋𝑛subscript𝑋𝑛1superscriptsubscript𝑒𝑖K(x_{i},X_{n})K(X_{n},X_{n})^{-1}=e_{i}^{\intercal}italic_K ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_K ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT, leading to f^GP(xi;𝒟n)=ei(f(Xn)μ0(Xn))+μ0(xi)=f(xi)subscript^𝑓GPsubscript𝑥𝑖subscript𝒟𝑛superscriptsubscript𝑒𝑖𝑓subscript𝑋𝑛subscript𝜇0subscript𝑋𝑛subscript𝜇0subscript𝑥𝑖𝑓subscript𝑥𝑖\hat{f}_{\text{GP}}(x_{i};\mathcal{D}_{n})=e_{i}^{\intercal}(f(X_{n})-\mu_{0}(% X_{n}))+\mu_{0}(x_{i})=f(x_{i})over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ( italic_f ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) + italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

Proposition 4.1 (Local consistency of GP mean predictor).

Assume a GP’s covariance function 𝒦𝒦\mathcal{K}caligraphic_K is continuous and positive definite. Also assume f𝑓f\in\mathcal{H}italic_f ∈ caligraphic_H, the RKHS induced by 𝒦𝒦\mathcal{K}caligraphic_K. Then the GP posterior mean is locally consistent.

Note that we do not make any probabilistic assumption regarding the black-box function in Proposition 4.1. This is the essence of our PseudoBO framework, that we only extract properties needed to guarantee algorithmic convergence instead of using probabilistic assumptions from the Bayesian framework.

Next, we consider the nearest neighbor predictor

f^NN(x0;𝒟n)=f(argminxXnxx0)subscript^𝑓NNsubscript𝑥0subscript𝒟𝑛𝑓subscriptargmin𝑥subscript𝑋𝑛𝑥subscript𝑥0\hat{f}_{\text{NN}}(x_{0};\mathcal{D}_{n})=f(\operatorname*{arg\,min}_{x\in X_% {n}}\left\lVert x-x_{0}\right\rVert)over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT NN end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_f ( start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_x ∈ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_x - italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ ) (2)

for a test point x0𝒳subscript𝑥0𝒳x_{0}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X.

Proposition 4.2 (Local consistency of nearest neighbor).

Assume f𝑓fitalic_f is continuous. Then the nearest neighbor predictor is locally consistent.

Next, a neural network predictor at point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is given by:

f^Net(x0;𝒟n)=fϕN,𝐇NfϕN1,𝐇N1fϕ1,𝐇1(x0),subscript^𝑓Netsubscript𝑥0subscript𝒟𝑛subscript𝑓subscriptitalic-ϕ𝑁superscriptsubscript𝐇𝑁subscript𝑓subscriptitalic-ϕ𝑁1superscriptsubscript𝐇𝑁1subscript𝑓subscriptitalic-ϕ1superscriptsubscript𝐇1subscript𝑥0\hat{f}_{\text{Net}}(x_{0};\mathcal{D}_{n})=f_{\phi_{N},\mathbf{H}_{N}^{*}}% \circ f_{\phi_{N-1},\mathbf{H}_{N-1}^{*}}\circ...\circ f_{\phi_{1},\mathbf{H}_% {1}^{*}}(x_{0}),over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Net end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∘ … ∘ italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (3)

where fϕi,𝐇i(x):=ϕi(𝐇i[x,1])assignsubscript𝑓subscriptitalic-ϕ𝑖subscript𝐇𝑖superscript𝑥subscriptitalic-ϕ𝑖subscript𝐇𝑖superscript𝑥1f_{\phi_{i},\mathbf{H}_{i}}(x^{\prime}):=\phi_{i}({\mathbf{H}_{i}}[x^{\prime},% 1])italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 ] ), 𝐇1d1×(d+1)subscript𝐇1superscriptsubscript𝑑1𝑑1\mathbf{H}_{1}\in\mathbb{R}^{d_{1}\times(d+1)}bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ( italic_d + 1 ) end_POSTSUPERSCRIPT, 𝐇idi+1×(di+1)subscript𝐇𝑖superscriptsubscript𝑑𝑖1subscript𝑑𝑖1\mathbf{H}_{i}\in\mathbb{R}^{d_{i+1}\times(d_{i}+1)}bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT × ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 ) end_POSTSUPERSCRIPT for i2𝑖2i\geq 2italic_i ≥ 2, N𝑁Nitalic_N denotes the number of layers, ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes an activation function, and (𝐇1,,𝐇N):=argmin𝐇1,,𝐇N1nk=1n(fϕN,𝐇NfϕN1,𝐇N1fϕ1,𝐇1(xk)f(xk))2assignsuperscriptsubscript𝐇1superscriptsubscript𝐇𝑁subscriptargminsubscript𝐇1subscript𝐇𝑁1𝑛superscriptsubscript𝑘1𝑛superscriptsubscript𝑓subscriptitalic-ϕ𝑁subscript𝐇𝑁subscript𝑓subscriptitalic-ϕ𝑁1subscript𝐇𝑁1subscript𝑓subscriptitalic-ϕ1subscript𝐇1subscript𝑥𝑘𝑓subscript𝑥𝑘2(\mathbf{H}_{1}^{*},\ldots,\mathbf{H}_{N}^{*}):=\operatorname*{arg\,min}_{% \mathbf{H}_{1},\ldots,\mathbf{H}_{N}}\frac{1}{n}\sum_{k=1}^{n}(f_{\phi_{N},% \mathbf{H}_{N}}\circ f_{\phi_{N-1},\mathbf{H}_{N-1}}\circ...\circ f_{\phi_{1},% \mathbf{H}_{1}}(x_{k})-f(x_{k}))^{2}( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , bold_H start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) := start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_H start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ … ∘ italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Moreover, by over-parametrization we mean that for any dataset 𝒟nsubscript𝒟𝑛\mathcal{D}_{n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the hyperparameters, such as the depth of the network N𝑁Nitalic_N and the hidden dimensions {di}1iNsubscriptsubscript𝑑𝑖1𝑖𝑁\{d_{i}\}_{1\leq i\leq N}{ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_N end_POSTSUBSCRIPT can be adjusted so that (𝐇1,,𝐇N)superscriptsubscript𝐇1superscriptsubscript𝐇𝑁(\mathbf{H}_{1}^{*},\ldots,\mathbf{H}_{N}^{*})( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , bold_H start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) can be chosen to perfectly fit the data 𝒟nsubscript𝒟𝑛\mathcal{D}_{n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, i.e., f^Net(x;𝒟n)=f(x)subscript^𝑓Net𝑥subscript𝒟𝑛𝑓𝑥\hat{f}_{\text{Net}}(x;\mathcal{D}_{n})=f(x)over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Net end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_f ( italic_x ) for any xXn𝑥subscript𝑋𝑛x\in X_{n}italic_x ∈ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Proposition 4.3 (Local consistency of over-parameterized neural network).

Assume a neural network predictor is L𝐿Litalic_L-Lipschitz and over-parameterized. Then it is locally consistent.

In fact, a scrutiny at the proof of Proposition 4.3 reveals that its conclusion holds as long as the SP is L𝐿Litalic_L-Lipschitz and over-parameterized. That is, we do not need the neural network structure in our argument. Nonetheless, neural network offers a natural class of SP where the needed properties can be attained.

Another type of locally consistent SP is the family of models that, in some sense, average the values of the surrounding evaluated points, and the prediction values at any evaluated points are not necessarily the evaluated values. This includes, for instance, regression trees and local regression.

We start with the regression tree predictor f^Tree(x0;𝒟n)subscript^𝑓Treesubscript𝑥0subscript𝒟𝑛\hat{f}_{\text{Tree}}(x_{0};\mathcal{D}_{n})over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Tree end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), where the tree structure is parameterized by some splitting rule θ𝜃\thetaitalic_θ. The tree predictor at point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is given by:

f^Tree(x0;𝒟n)=i=1nwi(x0;θ)f(xi),subscript^𝑓Treesubscript𝑥0subscript𝒟𝑛superscriptsubscript𝑖1𝑛subscript𝑤𝑖subscript𝑥0𝜃𝑓subscript𝑥𝑖\hat{f}_{\text{Tree}}(x_{0};\mathcal{D}_{n})=\sum_{i=1}^{n}w_{i}(x_{0};\theta)% f(x_{i}),over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Tree end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_θ ) italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (4)

where wi(x0;θ)=𝟙{xiRl(x0,θ)}i=1n𝟙{xiRl(x0,θ)}subscript𝑤𝑖subscript𝑥0𝜃1subscript𝑥𝑖subscript𝑅𝑙subscript𝑥0𝜃superscriptsubscript𝑖1𝑛1subscript𝑥𝑖subscript𝑅𝑙subscript𝑥0𝜃w_{i}(x_{0};\theta)=\frac{\mathbbm{1}\{x_{i}\in R_{l(x_{0},\theta)}\}}{\sum_{i% =1}^{n}\mathbbm{1}\{x_{i}\in R_{l(x_{0},\theta)}\}}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_θ ) = divide start_ARG blackboard_1 { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT } end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT } end_ARG, and Rl(x0,θ)subscript𝑅𝑙subscript𝑥0𝜃R_{l(x_{0},\theta)}italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT denotes the region covered by the leaf l𝑙litalic_l, determined by both the input x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and θ𝜃\thetaitalic_θ. If i=1n𝟙{xiRl(x0,θ)}=0superscriptsubscript𝑖1𝑛1subscript𝑥𝑖subscript𝑅𝑙subscript𝑥0𝜃0\sum_{i=1}^{n}\mathbbm{1}\{x_{i}\in R_{l(x_{0},\theta)}\}=0∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT } = 0, then we define f^Tree(x0;𝒟n)=1ni=1nf(xi)subscript^𝑓Treesubscript𝑥0subscript𝒟𝑛1𝑛superscriptsubscript𝑖1𝑛𝑓subscript𝑥𝑖\hat{f}_{\text{Tree}}(x_{0};\mathcal{D}_{n})=\frac{1}{n}\sum_{i=1}^{n}f(x_{i})over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Tree end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

Proposition 4.4 (ϵitalic-ϵ\epsilonitalic_ϵ-relaxed local consistency of regression tree).

Assuming f𝑓fitalic_f is L𝐿Litalic_L-Lipschitz, the regression tree predictor (4) parameterized by splitting rule θ𝜃\thetaitalic_θ is 2Lmaxx𝒳diam(Rl(x,θ))2𝐿subscript𝑥𝒳diamsubscript𝑅𝑙𝑥𝜃2L\cdot\max_{x\in\mathcal{X}}\text{diam}(R_{l(x,\theta)})2 italic_L ⋅ roman_max start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT diam ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x , italic_θ ) end_POSTSUBSCRIPT )-relaxed locally consistent, where diam(R)diam𝑅\text{diam}(R)diam ( italic_R ) for a region R𝑅Ritalic_R refers to maxx,yRxy.subscript𝑥𝑦𝑅norm𝑥𝑦\max_{x,y\in R}\|x-y\|.roman_max start_POSTSUBSCRIPT italic_x , italic_y ∈ italic_R end_POSTSUBSCRIPT ∥ italic_x - italic_y ∥ .

Proposition 4.4 stipulates that if we configure the regression tree such that all its leaves are small, then it is ϵitalic-ϵ\epsilonitalic_ϵ-relaxed locally consistent with an ϵitalic-ϵ\epsilonitalic_ϵ that scales with this small leaf size and the Lipschitz constant of f𝑓fitalic_f.

Next we present the local kernel regression predictor:

f^KR(x0;𝒟n)=inK(x0xih)f(xi)inK(x0xih)subscript^𝑓KRsubscript𝑥0subscript𝒟𝑛subscript𝑖𝑛𝐾normsubscript𝑥0subscript𝑥𝑖𝑓subscript𝑥𝑖subscript𝑖𝑛𝐾normsubscript𝑥0subscript𝑥𝑖\hat{f}_{\text{KR}}(x_{0};\mathcal{D}_{n})=\frac{\sum_{i\leq n}K\left(\frac{\|% x_{0}-x_{i}\|}{h}\right)f(x_{i})}{\sum_{i\leq n}K\left(\frac{\|x_{0}-x_{i}\|}{% h}\right)}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT KR end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT italic_K ( divide start_ARG ∥ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG start_ARG italic_h end_ARG ) italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT italic_K ( divide start_ARG ∥ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG start_ARG italic_h end_ARG ) end_ARG (5)

where K𝐾Kitalic_K is a non-negative function with bounded support on [0,M]0𝑀[0,M][ 0 , italic_M ] for some M>0𝑀0M>0italic_M > 0, i.e., K(y)=0𝐾𝑦0K(y)=0italic_K ( italic_y ) = 0 for y>M𝑦𝑀y>Mitalic_y > italic_M and K(y)>0𝐾𝑦0K(y)>0italic_K ( italic_y ) > 0 for yM𝑦𝑀y\leq Mitalic_y ≤ italic_M. The hyperparameter h>00h>0italic_h > 0 is the bandwidth. Definition (5) is well-defined as long as there is an xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for in𝑖𝑛i\leq nitalic_i ≤ italic_n such that x0xiMhnormsubscript𝑥0subscript𝑥𝑖𝑀\|x_{0}-x_{i}\|\leq Mh∥ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≤ italic_M italic_h. If the latter does not hold, we output a prefixed constant, e.g., 0, as the prediction value. Note that we can modify all the above slightly if the bounded support is on [0,M)0𝑀[0,M)[ 0 , italic_M ), i.e., when y=M𝑦𝑀y=Mitalic_y = italic_M, K(y)=0𝐾𝑦0K(y)=0italic_K ( italic_y ) = 0 instead of being positive, and all the discussion can be easily adopted accordingly. Definition 5 mimics the Nadaraya-Watson estimator. As a simple example, we can take K(y)𝐾𝑦K(y)italic_K ( italic_y ) to be a positive constant for y[0,M]𝑦0𝑀y\in[0,M]italic_y ∈ [ 0 , italic_M ].

Proposition 4.5 (ϵitalic-ϵ\epsilonitalic_ϵ-relaxed local consistency of local kernel regression).

For local kernel regression predictor (5), where K𝐾Kitalic_K is a non-negative function with bounded support on [0,M]0𝑀[0,M][ 0 , italic_M ] for some M>0𝑀0M>0italic_M > 0, and h>00h>0italic_h > 0 is the bandwidth. Assume f𝑓fitalic_f is L𝐿Litalic_L-Lipschitz. Then the local kernel regression estimator is LMh𝐿𝑀LMhitalic_L italic_M italic_h-relaxed locally consistent.

In typical statistical contexts, the hyperparameters θ𝜃\thetaitalic_θ and hhitalic_h in regression trees and local kernel regression scale with the sample size n𝑛nitalic_n. However, here we take θ𝜃\thetaitalic_θ and hhitalic_h to be fixed. As discussed before, unlike in conventional statistical contexts, we make no assumption on how xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are distributed; in fact, xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be selected according to a completely deterministic process in the PseudoBO algorithm, and thus the standard way of tuning these hyperparameters does not apply. Instead, we fix them and aim for ϵitalic-ϵ\epsilonitalic_ϵ-relaxed local consistency, i.e., we pay a small price on the convergence due to fixing these hyperparameters, in exchange for our capability to bypass the tuning issue.

Local consistency can be preserved if we apply the locally consistent SP together with a “randomized prior”. This randomized prior means we first sample from a random field to obtain a function, then we perturb the function values of the evaluated points by this randomly generated function and construct a predictor. When we output the prediction values, we compensate the prediction from the perturbed data by this function. This idea is more relevant when we discuss UQ, where the randomized prior serves as a natural artifact to quantify uncertainty. Here, we want to point out that an average over these perturbed predictions also give rise to valid SPs.

More precisely, let r:𝒳:𝑟𝒳r:\mathcal{X}\to\mathbb{R}italic_r : caligraphic_X → blackboard_R be a continuous function randomly generated from a generating distribution \mathcal{R}caligraphic_R. Suppose a “base” predictor is f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG. The randomized prior mean prediction is

f^RP(x0;𝒟n)=𝔼r[f^(r)(x0;𝒟n)],subscript^𝑓RPsubscript𝑥0subscript𝒟𝑛subscript𝔼similar-to𝑟delimited-[]superscript^𝑓𝑟subscript𝑥0subscript𝒟𝑛\hat{f}_{\text{RP}}(x_{0};\mathcal{D}_{n})=\mathbb{E}_{r\sim\mathcal{R}}[\hat{% f}^{(r)}(x_{0};\mathcal{D}_{n})],over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT RP end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] , (6)

where f^(r)(x0;𝒟n)=r(x0)+f^(x0;𝒟n(r))superscript^𝑓𝑟subscript𝑥0subscript𝒟𝑛𝑟subscript𝑥0^𝑓subscript𝑥0superscriptsubscript𝒟𝑛𝑟\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})=r(x_{0})+\hat{f}(x_{0};\mathcal{D}_{n}^{(% r)})over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_r ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ) is a “perturb-then-compensate” predictor using the function r𝑟ritalic_r, and 𝒟n(r):={(xi,f(xi)r(xi))}assignsuperscriptsubscript𝒟𝑛𝑟subscript𝑥𝑖𝑓subscript𝑥𝑖𝑟subscript𝑥𝑖\mathcal{D}_{n}^{(r)}:=\{(x_{i},f(x_{i})-r(x_{i}))\}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT := { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_r ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) } is the r𝑟ritalic_r-perturbed data set.

Proposition 4.6 (Preservation of ϵitalic-ϵ\epsilonitalic_ϵ-relaxed local consistency of randomized prior mean).

Suppose r𝑟ritalic_r is a continuous function randomly generated from the generating distribution \mathcal{R}caligraphic_R, and f^(r)superscript^𝑓𝑟\hat{f}^{(r)}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT is ϵitalic-ϵ\epsilonitalic_ϵ-relaxed locally consistent in predicting f𝑓fitalic_f for almost surely any r𝑟ritalic_r. Also assume f,f^(r)𝑓superscript^𝑓𝑟f,\hat{f}^{(r)}italic_f , over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT are uniformly bounded. Then the randomized prior mean (6) is ϵitalic-ϵ\epsilonitalic_ϵ-relaxed locally consistent.

Lastly, we have another useful preservation result for local consistency. Specifically, if we have a class of (ϵitalic-ϵ\epsilonitalic_ϵ-relaxed) locally consistent SPs, then their convex combination, which we call a hybrid SP, will also enjoy (ϵitalic-ϵ\epsilonitalic_ϵ-relaxed) local consistency.

Proposition 4.7 (Preservation of ϵitalic-ϵ\epsilonitalic_ϵ-relaxed local consistency of hybrid SP).

Given a finite set of ϵitalic-ϵ\epsilonitalic_ϵ-relaxed locally consistent SPs ^:={f1^,f2^,}assign^^subscript𝑓1^subscript𝑓2\hat{\mathcal{F}}:=\{\hat{f_{1}},\hat{f_{2}},...\}over^ start_ARG caligraphic_F end_ARG := { over^ start_ARG italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , over^ start_ARG italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , … }, the hybrid SP i=1|^|αif^isuperscriptsubscript𝑖1^subscript𝛼𝑖subscript^𝑓𝑖\sum_{i=1}^{|\hat{\mathcal{F}}|}\alpha_{i}\hat{f}_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | over^ start_ARG caligraphic_F end_ARG | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where i=1|^|αi=1superscriptsubscript𝑖1^subscript𝛼𝑖1\sum_{i=1}^{|\hat{\mathcal{F}}|}\alpha_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | over^ start_ARG caligraphic_F end_ARG | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 and αi0subscript𝛼𝑖0\alpha_{i}\geq 0italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0, is ϵitalic-ϵ\epsilonitalic_ϵ-relaxed locally consistent.

Clearly, Propositions 4.6 and 4.7 apply to the basic notion of local consistency in Definition 3.3 as a special case when taking ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0. Moreover, in the latter case, Proposition 4.7 can be strengthened to allow αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be negative by using a similar proof.

4.2 UQ with SNEB Property

SNEB in a UQ entails that the quantified uncertainty at the queried points decreases to 0 as the surrounding evaluated points populate, while the uncertainty stays positive for any unexplored area. To verify SNEB for a UQ that is associated with a particular SP, it is typically the case that if the SP is locally consistent and that the UQ is a “truthful” representation of the uncertainty, then part 2 of the SNEB property in Assumption 3.4 holds. On the other hand, the verification of part 1 of that assumption would require further conditions, as it is about the unexplored region where local consistency property does not touch upon.

We first consider the GP posterior variance (or standard deviation). This UQ, at point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, is given by:

σ^GP2(x0;𝒟n):=𝒦(x0,x0)𝒦(x0,Xn)𝒦(Xn,Xn)1𝒦(Xn,x0)assignsuperscriptsubscript^𝜎GP2subscript𝑥0subscript𝒟𝑛𝒦subscript𝑥0subscript𝑥0𝒦subscript𝑥0subscript𝑋𝑛𝒦superscriptsubscript𝑋𝑛subscript𝑋𝑛1𝒦subscript𝑋𝑛subscript𝑥0\hat{\sigma}_{\text{GP}}^{2}(x_{0};\mathcal{D}_{n}):=\mathcal{K}(x_{0},x_{0})-% \mathcal{K}(x_{0},X_{n}){\mathcal{K}(X_{n},X_{n})^{-1}}\mathcal{K}(X_{n},x_{0})over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) := caligraphic_K ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - caligraphic_K ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) caligraphic_K ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_K ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (7)

which is derived through modeling f𝑓fitalic_f as a GP with prior covariance represented via 𝒦𝒦\mathcal{K}caligraphic_K [54]. Note that σ^GP2(x0;𝒟n)superscriptsubscript^𝜎GP2subscript𝑥0subscript𝒟𝑛\hat{\sigma}_{\text{GP}}^{2}(x_{0};\mathcal{D}_{n})over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) does not depend on the prior mean function μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT or the labels in the data as seen from (7).

Furthermore, we call a covariance function stationary if it has the property 𝒦(x,y)=𝒦(xy)𝒦𝑥𝑦𝒦𝑥𝑦\mathcal{K}(x,y)=\mathcal{K}(x-y)caligraphic_K ( italic_x , italic_y ) = caligraphic_K ( italic_x - italic_y ). That is, the function evaluations on the two inputs x𝑥xitalic_x and y𝑦yitalic_y only depend on their difference. This property is commonly imposed for modeling black-box functions, and popular examples are Gaussian covariance and Matérn covariance:

𝒦Gaussian(x,y)=exp{xy22h2},subscript𝒦𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛𝑥𝑦superscriptnorm𝑥𝑦22superscript2\mathcal{K}_{Gaussian}(x,y)=\exp\{-\frac{\|x-y\|^{2}}{2h^{2}}\},caligraphic_K start_POSTSUBSCRIPT italic_G italic_a italic_u italic_s italic_s italic_i italic_a italic_n end_POSTSUBSCRIPT ( italic_x , italic_y ) = roman_exp { - divide start_ARG ∥ italic_x - italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG } , (8)

where the bandwidth hhitalic_h is a positive hyperparameter, and

𝒦Matern(x,y)=21νΓ(ν)(2νxyρ)νKν(2νxyρ),subscript𝒦𝑀𝑎𝑡𝑒𝑟𝑛𝑥𝑦superscript21𝜈Γ𝜈superscript2𝜈norm𝑥𝑦𝜌𝜈subscript𝐾𝜈2𝜈norm𝑥𝑦𝜌\mathcal{K}_{Matern}(x,y)=\frac{2^{1-\nu}}{\Gamma(\nu)}(\sqrt{2\nu}\frac{\|x-y% \|}{\rho})^{\nu}K_{\nu}(\sqrt{2\nu}\frac{\|x-y\|}{\rho}),caligraphic_K start_POSTSUBSCRIPT italic_M italic_a italic_t italic_e italic_r italic_n end_POSTSUBSCRIPT ( italic_x , italic_y ) = divide start_ARG 2 start_POSTSUPERSCRIPT 1 - italic_ν end_POSTSUPERSCRIPT end_ARG start_ARG roman_Γ ( italic_ν ) end_ARG ( square-root start_ARG 2 italic_ν end_ARG divide start_ARG ∥ italic_x - italic_y ∥ end_ARG start_ARG italic_ρ end_ARG ) start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( square-root start_ARG 2 italic_ν end_ARG divide start_ARG ∥ italic_x - italic_y ∥ end_ARG start_ARG italic_ρ end_ARG ) , (9)

where ΓΓ\Gammaroman_Γ is the gamma function, Kνsubscript𝐾𝜈K_{\nu}italic_K start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT is the modified Bessel function of the second kind, and ρ𝜌\rhoitalic_ρ, ν𝜈\nuitalic_ν are positive hyperparameters for the covariance function.

Finally, the spectral density of a GP characterizes its frequency domain. For a stationary GP, the covariance function 𝒦(τ)𝒦𝜏\mathcal{K}(\tau)caligraphic_K ( italic_τ ), where τ:=xyassign𝜏𝑥𝑦\tau:=x-yitalic_τ := italic_x - italic_y, can be transformed into the frequency domain using the Fourier transform, which gives us the spectral density. Mathematically, the spectral density 𝒮(ω)𝒮𝜔\mathcal{S}(\omega)caligraphic_S ( italic_ω ) is defined as the Fourier transform of the covariance function:

𝒮(ω)=𝒦(τ)exp(iωτ)dτ.𝒮𝜔superscriptsubscript𝒦𝜏𝑖𝜔𝜏d𝜏\mathcal{S}(\omega)=\int_{-\infty}^{\infty}\mathcal{K}(\tau)\exp(-i\omega\tau)% \text{d}\tau.caligraphic_S ( italic_ω ) = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT caligraphic_K ( italic_τ ) roman_exp ( - italic_i italic_ω italic_τ ) d italic_τ .

With all the above, we have the following guarantee:

Proposition 4.8 (SNEB of GP posterior standard deviation).

Under the same assumption as in Proposition 4.1, and additionally, the GP is stationary and has spectral density 𝒮𝒮\mathcal{S}caligraphic_S, with the property that 𝒮1superscript𝒮1\mathcal{S}^{-1}caligraphic_S start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT has at most polynomial growth, then the GP posterior standard deviation σ^GP(x0;𝒟n)subscript^𝜎GPsubscript𝑥0subscript𝒟𝑛\hat{\sigma}_{\text{GP}}(x_{0};\mathcal{D}_{n})over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is SNEB.

Proposition 4.8 largely follows from the seminal GP result in [68]. Next, we also have that the minimum distance UQ is SNEB.

Proposition 4.9 (SNEB of minimum distance).

ΔΔ\Deltaroman_Δ is SNEB.

Next we present a UQ utilizing randomized prior, which we have briefly discussed in the SP examples in Section 4.1. This method first randomly generates a continuous function r𝑟ritalic_r from a generating distribution \mathcal{R}caligraphic_R. We fit the SP f^(r)(x0;𝒟n)=r(x0)+f^(x0;𝒟n(r))superscript^𝑓𝑟subscript𝑥0subscript𝒟𝑛𝑟subscript𝑥0^𝑓subscript𝑥0subscriptsuperscript𝒟𝑟𝑛\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})=r(x_{0})+\hat{f}(x_{0};\mathcal{D}^{(r)}_% {n})over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_r ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), where f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG is a base SP and 𝒟n(r)subscriptsuperscript𝒟𝑟𝑛\mathcal{D}^{(r)}_{n}caligraphic_D start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a data set perturbed by r𝑟ritalic_r, i.e., 𝒟n(r):=(xi,f(xi)r(xi))i=1nassignsuperscriptsubscript𝒟𝑛𝑟superscriptsubscriptsubscript𝑥𝑖𝑓subscript𝑥𝑖𝑟subscript𝑥𝑖𝑖1𝑛\mathcal{D}_{n}^{(r)}:=(x_{i},f(x_{i})-r(x_{i}))_{i=1}^{n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT := ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_r ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Upon repeating the sampling of r𝑟ritalic_r many times, the UQ can be computed as (Varr[f^(r)(x0;𝒟n)])1/2superscriptsubscriptVarsimilar-to𝑟superscript^𝑓𝑟subscript𝑥0subscript𝒟𝑛12(\operatorname{Var}_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})])^% {\nicefrac{{1}}{{2}}}( roman_Var start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT.

Here is the rationale of this randomized prior UQ. Suppose a test point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is surrounded by many evaluated points. Then, if f𝑓fitalic_f is continuous and f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG is locally consistent, we have f^(x0;𝒟n(r))^𝑓subscript𝑥0subscriptsuperscript𝒟𝑟𝑛\hat{f}(x_{0};\mathcal{D}^{(r)}_{n})over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) roughly equal to f(x0)r(x0)𝑓subscript𝑥0𝑟subscript𝑥0f(x_{0})-r(x_{0})italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_r ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), so that f^(r)(x0;𝒟n)superscript^𝑓𝑟subscript𝑥0subscript𝒟𝑛\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) becomes roughly f(x0)𝑓subscript𝑥0f(x_{0})italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) by canceling out the r(x0)𝑟subscript𝑥0r(x_{0})italic_r ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Thus, Varr[f^(r)(x0;𝒟n)]subscriptVarsimilar-to𝑟superscript^𝑓𝑟subscript𝑥0subscript𝒟𝑛\operatorname{Var}_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})]roman_Var start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] becomes close to zero. This signals a low uncertainty at x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. On the other hand, suppose x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT has no surrounding evaluated points, or in other words the region surrounding x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is sparsely evaluated. Then, f^(x0;𝒟n(r))^𝑓subscript𝑥0subscriptsuperscript𝒟𝑟𝑛\hat{f}(x_{0};\mathcal{D}^{(r)}_{n})over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) would have weak reliance on r𝑟ritalic_r. For example, f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG could just output a constant for x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT when there is no neighboring evaluated points. In this case, f^(r)(x0;𝒟n)superscript^𝑓𝑟subscript𝑥0subscript𝒟𝑛\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) will equal r(x0)𝑟subscript𝑥0r(x_{0})italic_r ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), which has a high variance. This signals a high uncertainty at x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This “perturb-then-compensate” principle of randomized prior thus naturally distinguishes between densely and sparsely evaluated regions.

As far as we know, the concept of randomized prior appears initially in [48] to quantify uncertainty in deep reinforcement. Compared to the bootstrap [19], this approach appears procedurally similar but conceptually different. In particular, randomized prior uses the “perturb-then-compensate” principle to signal high uncertainty for unexplored regions, while the bootstrap is designed to quantify the statistical errors arising from data randomness. The latter arguably works only when data are sufficiently abundant, since its underlying resampling can accurately resemble the sampling distribution only in such situations. Moreover, the uncertainty it captures comes from data randomness, while in the PseudoBO setting there is no data randomness and the uncertainty comes purely from the lack of knowledge on the objective function. Finally, we mention that [48] proposes randomized prior with neural network; however, we generalize it here to broader regressors and can pair it with any other locally consistent SPs.

Finally, if the SP is ϵitalic-ϵ\epsilonitalic_ϵ-relaxed locally consistent instead of exactly locally consistent, we can use

((Varr[f^(r)(x0;𝒟n)])1/2ϵ)+subscriptsuperscriptsubscriptVarsimilar-to𝑟superscript^𝑓𝑟subscript𝑥0subscript𝒟𝑛12italic-ϵ((\operatorname{Var}_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})])% ^{\nicefrac{{1}}{{2}}}-\epsilon)_{+}( ( roman_Var start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT - italic_ϵ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT (10)

for any point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where (y)+subscript𝑦(y)_{+}( italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is the positive part function, i.e., =yabsent𝑦=y= italic_y if y>0𝑦0y>0italic_y > 0 and =0absent0=0= 0 otherwise. This definition allows for the ϵitalic-ϵ\epsilonitalic_ϵ prediction error, so that when the prediction uncertainty signified by (Varr[f^(r)(x0;𝒟n)])1/2superscriptsubscriptVarsimilar-to𝑟superscript^𝑓𝑟subscript𝑥0subscript𝒟𝑛12(\operatorname{Var}_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})])^% {\nicefrac{{1}}{{2}}}( roman_Var start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT is less than ϵitalic-ϵ\epsilonitalic_ϵ, we take the UQ as 0 in the PseudoBO procedure. Correspondingly, the UQ is positive only if the test point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is sufficiently far away from all evaluated points, giving rise to a δ𝛿\deltaitalic_δ-relaxed notion of SNEB.

Proposition 4.10 (δ𝛿\deltaitalic_δ-relaxed SNEB of randomized prior standard deviation).

Under the same assumptions as Proposition 4.6, and that (Varr[f^(r)(x0;𝒟n)])1/2>ϵsuperscriptsubscriptVarsimilar-to𝑟superscript^𝑓𝑟subscript𝑥0subscript𝒟𝑛12italic-ϵ(\operatorname{Var}_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})])^% {\nicefrac{{1}}{{2}}}>\epsilon( roman_Var start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT > italic_ϵ whenever infnΔ(x0;𝒟n)>δsubscriptinfimum𝑛Δsubscript𝑥0subscript𝒟𝑛𝛿\inf_{n}\Delta(x_{0};\mathcal{D}_{n})>\deltaroman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ for any x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒟nsubscript𝒟𝑛\mathcal{D}_{n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, for some fixed δ>0𝛿0\delta>0italic_δ > 0. Then the UQ (10) is δ𝛿\deltaitalic_δ-relaxed SNEB.

We can apply randomized prior variance to local regression SP defined in (5), giving the following:

Corollary 4.11 (δ𝛿\deltaitalic_δ-relaxed SNEB of randomized prior standard deviation for local regression).

Under the same assumptions as Proposition 4.5, suppose r𝑟ritalic_r is a L~~𝐿\tilde{L}over~ start_ARG italic_L end_ARG-Lipschitz and uniformly bounded function randomly generated from the generating distribution \mathcal{R}caligraphic_R, with (Varr[r(x)])1/2>(L+L~)MhsuperscriptsubscriptVarsimilar-to𝑟𝑟𝑥12𝐿~𝐿𝑀(\operatorname{Var}_{r\sim\mathcal{R}}[r(x)])^{\nicefrac{{1}}{{2}}}>(L+\tilde{% L})Mh( roman_Var start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ italic_r ( italic_x ) ] ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT > ( italic_L + over~ start_ARG italic_L end_ARG ) italic_M italic_h for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. Also assume f𝑓fitalic_f is bounded. Then the UQ ((Varr[f^(r)(x0;𝒟n)])1/2(L+L~)Mh)+subscriptsuperscriptsubscriptVarsimilar-to𝑟superscript^𝑓𝑟subscript𝑥0subscript𝒟𝑛12𝐿~𝐿𝑀((\operatorname{Var}_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})])% ^{\nicefrac{{1}}{{2}}}-(L+\tilde{L})Mh)_{+}( ( roman_Var start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT - ( italic_L + over~ start_ARG italic_L end_ARG ) italic_M italic_h ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT for any point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, with base SP (5), is Mh𝑀Mhitalic_M italic_h-relaxed SNEB.

Finally, like our discussion of SP, we also have the preservation of (δ𝛿\deltaitalic_δ-relaxed) SNEB for a combination of UQs which we call hybrid UQ:

Proposition 4.12 (Preservation of δ𝛿\deltaitalic_δ-relaxed SNEB of hybrid UQ).

Given a finite set of δ𝛿\deltaitalic_δ-relaxed SNEB UQs Σ^:={σ1^,σ2^,}assign^Σ^subscript𝜎1^subscript𝜎2\hat{\Sigma}:=\{\hat{\sigma_{1}},\hat{\sigma_{2}},...\}over^ start_ARG roman_Σ end_ARG := { over^ start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , over^ start_ARG italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , … }, the hybrid UQ i=1|Σ^|αiσi^superscriptsubscript𝑖1^Σsubscript𝛼𝑖^subscript𝜎𝑖\sum_{i=1}^{|\hat{\Sigma}|}\alpha_{i}\hat{\sigma_{i}}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | over^ start_ARG roman_Σ end_ARG | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG, where i=1|Σ^|αi=1superscriptsubscript𝑖1^Σsubscript𝛼𝑖1\sum_{i=1}^{|\hat{\Sigma}|}\alpha_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | over^ start_ARG roman_Σ end_ARG | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 and αi0subscript𝛼𝑖0\alpha_{i}\geq 0italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0, is δ𝛿\deltaitalic_δ-relaxed SNEB.

Similar to our discussion of SP in Section 4.1, Proposition 4.12 reduces to the case of exact SNEB when δ=0𝛿0\delta=0italic_δ = 0. On the other hand, unlike there, we can readily see from the proof of Proposition 4.12 that we can relax the convex combination of UQs to any linear combination of UQs with weights αi0subscript𝛼𝑖0\alpha_{i}\geq 0italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 for all i𝑖iitalic_i with at least one of the weights being strictly positive. Nonetheless, from a practical perspective, scaling the UQs by a multiplicative factor does not change the PseudoBO performance as long as we suitably rescale the AF function.

4.3 AF with Improvement Property

The improvement property of AF, which signifies a zero EW for points that certainly lead to no improvement while positive for uncertain points, is satisfied by classical criteria such as probability of improvement (PI) and expected improvement (EI). These criteria can be written as

gnPI(pn,qn)={Φ(pnτqn),ifqn>0,𝟙{pnτ>0},ifqn=0.superscriptsubscript𝑔𝑛PIsubscript𝑝𝑛subscript𝑞𝑛casesΦsubscript𝑝𝑛𝜏subscript𝑞𝑛ifsubscript𝑞𝑛01subscript𝑝𝑛𝜏0ifsubscript𝑞𝑛0g_{n}^{\text{PI}}(p_{n},q_{n})=\begin{cases}\Phi\left(\frac{p_{n}-\tau}{q_{n}}% \right),&\text{if}\ q_{n}>0,\\ \mathbbm{1}\{p_{n}-\tau>0\},&\text{if}\ q_{n}=0.\\ \end{cases}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT PI end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = { start_ROW start_CELL roman_Φ ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_τ end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) , end_CELL start_CELL if italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > 0 , end_CELL end_ROW start_ROW start_CELL blackboard_1 { italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_τ > 0 } , end_CELL start_CELL if italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0 . end_CELL end_ROW (11)

for PI, and

gnEI(pn,qn)={qnϕ(pnτqn)+(pnτ)Φ(pnτqn),ifqn>0,max{pnτ,0},ifqn=0.superscriptsubscript𝑔𝑛EIsubscript𝑝𝑛subscript𝑞𝑛casessubscript𝑞𝑛italic-ϕsubscript𝑝𝑛𝜏subscript𝑞𝑛subscript𝑝𝑛𝜏Φsubscript𝑝𝑛𝜏subscript𝑞𝑛ifsubscript𝑞𝑛0subscript𝑝𝑛𝜏0ifsubscript𝑞𝑛0g_{n}^{\text{EI}}(p_{n},q_{n})=\begin{cases}q_{n}\phi(\frac{p_{n}-\tau}{q_{n}}% )+(p_{n}-\tau)\Phi(\frac{p_{n}-\tau}{q_{n}}),&\text{if}\ q_{n}>0,\\ \max\{p_{n}-\tau,0\},&\text{if}\ q_{n}=0.\\ \end{cases}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT EI end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = { start_ROW start_CELL italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ϕ ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_τ end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) + ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_τ ) roman_Φ ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_τ end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) , end_CELL start_CELL if italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > 0 , end_CELL end_ROW start_ROW start_CELL roman_max { italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_τ , 0 } , end_CELL start_CELL if italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0 . end_CELL end_ROW (12)

for EI, where ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) and Φ()Φ\Phi(\cdot)roman_Φ ( ⋅ ) denote the standard normal density and distribution functions, and τ0𝜏0\tau\geq 0italic_τ ≥ 0 is a small hyperparameter to ensure smoothness at the “boundary” between qn=0subscript𝑞𝑛0q_{n}=0italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0 and qn>0subscript𝑞𝑛0q_{n}>0italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > 0, especially in the PI case where we will set τ>0𝜏0\tau>0italic_τ > 0. Typically, the potential improvement pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is defined as f^(x;𝒟n)maxΠf(𝒟n)^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), which is the value of the SP f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG above the best evaluated objective value maxΠf(𝒟n)subscriptΠ𝑓subscript𝒟𝑛\max\Pi_{f}(\mathcal{D}_{n})roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), and the uncertainty estimate qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is taken as our UQ σ^(x;𝒟n)^𝜎𝑥subscript𝒟𝑛\hat{\sigma}(x;\mathcal{D}_{n})over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). With these, (11) and (12) give

PI(x;𝒟n)={Φ(f^(x;𝒟n)maxΠf(𝒟n)τσ^(x;𝒟n)),ifσ^(x;𝒟n)>0,𝟙{f^(x;𝒟n)maxΠf(𝒟n)τ>0},ifσ^(x;𝒟n)=0.PI𝑥subscript𝒟𝑛casesΦ^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛𝜏^𝜎𝑥subscript𝒟𝑛if^𝜎𝑥subscript𝒟𝑛01^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛𝜏0if^𝜎𝑥subscript𝒟𝑛0\text{PI}(x;\mathcal{D}_{n})=\begin{cases}\Phi\left(\frac{\hat{f}(x;\mathcal{D% }_{n})-\max\Pi_{f}(\mathcal{D}_{n})-\tau}{\hat{\sigma}(x;\mathcal{D}_{n})}% \right),&\text{if}\ \hat{\sigma}(x;\mathcal{D}_{n})>0,\\ \mathbbm{1}\{\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})-\tau>0\},% &\text{if}\ \hat{\sigma}(x;\mathcal{D}_{n})=0.\\ \end{cases}PI ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = { start_ROW start_CELL roman_Φ ( divide start_ARG over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_τ end_ARG start_ARG over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG ) , end_CELL start_CELL if over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0 , end_CELL end_ROW start_ROW start_CELL blackboard_1 { over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_τ > 0 } , end_CELL start_CELL if over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = 0 . end_CELL end_ROW (13)

and

EI(x;𝒟n)EI𝑥subscript𝒟𝑛\displaystyle\text{EI}(x;\mathcal{D}_{n})EI ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (14)
=\displaystyle== {σ^(x;𝒟n)ϕ(f^(x;𝒟n)maxΠf(𝒟n)τσ^(x;𝒟n))+(f^(x;𝒟n)maxΠf(𝒟n)τ)Φ(f^(x;𝒟n)maxΠf(𝒟n)τσ^(x;𝒟n)),ifσ^(x;𝒟n)>0,max{f^(x;𝒟n)maxΠf(𝒟n)τ,0},ifσ^(x;𝒟n)=0.cases^𝜎𝑥subscript𝒟𝑛italic-ϕ^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛𝜏^𝜎𝑥subscript𝒟𝑛^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛𝜏Φ^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛𝜏^𝜎𝑥subscript𝒟𝑛otherwiseif^𝜎𝑥subscript𝒟𝑛0otherwise^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛𝜏0otherwiseif^𝜎𝑥subscript𝒟𝑛0otherwise\displaystyle\begin{cases}\hat{\sigma}(x;\mathcal{D}_{n})\phi(\frac{\hat{f}(x;% \mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})-\tau}{\hat{\sigma}(x;\mathcal{D}% _{n})})+(\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})-\tau)\Phi(% \frac{\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})-\tau}{\hat{% \sigma}(x;\mathcal{D}_{n})}),\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{if}\ \hat{\sigma}(x;% \mathcal{D}_{n})>0,\\ \max\{\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})-\tau,0\},\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{if}\ \hat{\sigma}(x;% \mathcal{D}_{n})=0.\\ \end{cases}{ start_ROW start_CELL over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_ϕ ( divide start_ARG over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_τ end_ARG start_ARG over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG ) + ( over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_τ ) roman_Φ ( divide start_ARG over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_τ end_ARG start_ARG over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG ) , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL if over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0 , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_max { over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_τ , 0 } , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL if over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = 0 . end_CELL start_CELL end_CELL end_ROW

The above definitions are precisely P(Z>τ)𝑃𝑍𝜏P(Z>\tau)italic_P ( italic_Z > italic_τ ) and E[(Zτ)+]𝐸delimited-[]subscript𝑍𝜏E[(Z-\tau)_{+}]italic_E [ ( italic_Z - italic_τ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] respectively, where Z𝑍Zitalic_Z denotes a normal variable with mean pn=f^(x;𝒟n)maxΠf(𝒟n)subscript𝑝𝑛^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛p_{n}=\hat{f}(x;\mathcal{D}_{n})-\max\Pi_{f}(\mathcal{D}_{n})italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and standard deviation qn=σ^(x;𝒟n)subscript𝑞𝑛^𝜎𝑥subscript𝒟𝑛q_{n}=\hat{\sigma}(x;\mathcal{D}_{n})italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). To facilitate discussion, let τ=0𝜏0\tau=0italic_τ = 0 for now. In the BO literature, f^(x;𝒟n)^𝑓𝑥subscript𝒟𝑛\hat{f}(x;\mathcal{D}_{n})over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) corresponds to the posterior mean and σ^(x;𝒟n)^𝜎𝑥subscript𝒟𝑛\hat{\sigma}(x;\mathcal{D}_{n})over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) the posterior standard deviation, so that P(Z>0)𝑃𝑍0P(Z>0)italic_P ( italic_Z > 0 ) is the posterior probability under the GP model that a point x𝑥xitalic_x has a higher objective value than maxΠf(𝒟n)subscriptΠ𝑓subscript𝒟𝑛\max\Pi_{f}(\mathcal{D}_{n})roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), and E[Z+]𝐸delimited-[]subscript𝑍E[Z_{+}]italic_E [ italic_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] is the expectation of this excess objective value. The following verifies that these classical PI and EI criteria satisfy our improvement property:

Proposition 4.13 (Improvement Property of PI).

With τ>0𝜏0\tau>0italic_τ > 0, PI defined as (11) has the improvement property.

Proposition 4.14 (Improvement Property of EI).

With τ0𝜏0\tau\geq 0italic_τ ≥ 0, EI defined as (12) has the improvement property.

Another important acquisition approach is the upper confidence bound (UCB) widely used in online learning, defined as UCB(x;𝒟n)=f^(x;𝒟n)+βnσ^(x;𝒟n)UCB𝑥subscript𝒟𝑛^𝑓𝑥subscript𝒟𝑛subscript𝛽𝑛^𝜎𝑥subscript𝒟𝑛\text{UCB}(x;\mathcal{D}_{n})=\hat{f}(x;\mathcal{D}_{n})+\beta_{n}\hat{\sigma}% (x;\mathcal{D}_{n})UCB ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) where βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a step-dependent positive hyperparameter to trade-off exploitation and exploration. To connect to our AF notion, consider

gnUCB(pn,qn)=pnτβn+qnsuperscriptsubscript𝑔𝑛UCBsubscript𝑝𝑛subscript𝑞𝑛subscript𝑝𝑛𝜏subscript𝛽𝑛subscript𝑞𝑛g_{n}^{\text{UCB}}(p_{n},q_{n})=\frac{p_{n}-\tau}{\beta_{n}}+q_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT UCB end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_τ end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG + italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (15)

where τ0𝜏0\tau\geq 0italic_τ ≥ 0 is again a hyperparameter like PI and EI described earlier (in fact, we can merely set τ=0𝜏0\tau=0italic_τ = 0 in this case). Plugging in pn=f^(x;𝒟n)maxΠf(𝒟n)subscript𝑝𝑛^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛p_{n}=\hat{f}(x;\mathcal{D}_{n})-\max{\Pi_{f}(\mathcal{D}_{n})}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and qn=σ^(x;𝒟n)subscript𝑞𝑛^𝜎𝑥subscript𝒟𝑛q_{n}=\hat{\sigma}(x;\mathcal{D}_{n})italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), we can see that (15) is a rescaling of the more familiar form UCB(x;𝒟n)UCB𝑥subscript𝒟𝑛\text{UCB}(x;\mathcal{D}_{n})UCB ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and thus the maximizer is retained. More precisely, we have argmaxx𝒳f^(x;𝒟n)+βnσ^(x;𝒟n)=argmaxx𝒳(f^(x;𝒟n)maxΠf(𝒟n)τ)+βnσ^(x;𝒟n)=argmaxx𝒳pnτβn+qnsubscriptargmax𝑥𝒳^𝑓𝑥subscript𝒟𝑛subscript𝛽𝑛^𝜎𝑥subscript𝒟𝑛subscriptargmax𝑥𝒳^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛𝜏subscript𝛽𝑛^𝜎𝑥subscript𝒟𝑛subscriptargmax𝑥𝒳subscript𝑝𝑛𝜏subscript𝛽𝑛subscript𝑞𝑛\operatorname*{arg\,max}_{x\in\mathcal{X}}\hat{f}(x;\mathcal{D}_{n})+\beta_{n}% \hat{\sigma}(x;\mathcal{D}_{n})=\operatorname*{arg\,max}_{x\in\mathcal{X}}(% \hat{f}(x;\mathcal{D}_{n})-\max{\Pi_{f}(\mathcal{D}_{n})}-\tau)+\beta_{n}\hat{% \sigma}(x;\mathcal{D}_{n})=\operatorname*{arg\,max}_{x\in\mathcal{X}}\frac{p_{% n}-\tau}{\beta_{n}}+q_{n}start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_τ ) + italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_τ end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG + italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT where pn=f^(x;𝒟n)maxΠf(𝒟n)subscript𝑝𝑛^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛p_{n}=\hat{f}(x;\mathcal{D}_{n})-\max{\Pi_{f}(\mathcal{D}_{n})}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and qn=σ^(x;𝒟n)subscript𝑞𝑛^𝜎𝑥subscript𝒟𝑛q_{n}=\hat{\sigma}(x;\mathcal{D}_{n})italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

Proposition 4.15 (Improvement Property of UCB).

Suppose τ𝜏\tauitalic_τ is fixed, βn0subscript𝛽𝑛0\beta_{n}\geq 0italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ 0 is a sequence that goes to \infty as n𝑛n\to\inftyitalic_n → ∞, and pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is bounded. Then UCB defined as (15) has the improvement property.

Note that we have assumed the sequence pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is bounded in Proposition 4.15. When the objective function f𝑓fitalic_f and SP f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG are bounded, then pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT when plugged in as f^(x;𝒟n)maxΠf(𝒟n)^𝑓𝑥subscript𝒟𝑛subscriptΠ𝑓subscript𝒟𝑛\hat{f}(x;\mathcal{D}_{n})-\max{\Pi_{f}(\mathcal{D}_{n})}over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) would be bounded as well, so that this additional assumption is readily achieved.

Finally, like SP and UQ, the improvement property is also preserved if we consider a combination of AFs, which we call hybrid AF:

Proposition 4.16 (Improvement property of hybrid AF).

Given a finite set of AFs 𝒢:={g1,g2,}assign𝒢subscript𝑔1subscript𝑔2\mathcal{G}:=\{g_{1},g_{2},...\}caligraphic_G := { italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … } with the improvement property, the hybrid AF i=1|𝒢|αigisuperscriptsubscript𝑖1𝒢subscript𝛼𝑖subscript𝑔𝑖\sum_{i=1}^{|\mathcal{G}|}\alpha_{i}g_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_G | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where i=1|𝒢|αi=1superscriptsubscript𝑖1𝒢subscript𝛼𝑖1\sum_{i=1}^{|\mathcal{G}|}\alpha_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_G | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 and αi0subscript𝛼𝑖0\alpha_{i}\geq 0italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0, has the improvement property.

Similar to hybrid UQ, we can strengthen Proposition 4.16 to relax the convex combination to any linear combination of AFs with weights αi0subscript𝛼𝑖0\alpha_{i}\geq 0italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 for any i𝑖iitalic_i where at least one of these weights is strictly positive. Nonetheless, since a rescaling does not affect the maximizer of the AF, considering only convex combination is without loss of generality.

5 From Theory to Implementation

Our next goal is to leverage the PseudoBO recipe to construct empirically superior algorithms. We consider the following in choosing our ingredients for implementation:

Measuring UQ Quality. To look for a good UQ, we propose a criterion called calibrated coverage rate (CCR), which considers the quality of both SP and UQ. Given a training set 𝒟train:=(Xtrain,Ytrain)assignsubscript𝒟trainsubscript𝑋trainsubscript𝑌train\mathcal{D}_{\text{train}}:=(X_{\text{train}},Y_{\text{train}})caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT := ( italic_X start_POSTSUBSCRIPT train end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ), a validation set 𝒟val:=(Xval,Yval)assignsubscript𝒟valsubscript𝑋valsubscript𝑌val\mathcal{D}_{\text{val}}:=(X_{\text{val}},Y_{\text{val}})caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT := ( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ), and a test set 𝒟test:=(Xtest,Ytest)assignsubscript𝒟testsubscript𝑋testsubscript𝑌test\mathcal{D}_{\text{test}}:=(X_{\text{test}},Y_{\text{test}})caligraphic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT := ( italic_X start_POSTSUBSCRIPT test end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ), suppose we have a pretrained SP f^(;𝒟train)^𝑓subscript𝒟train\hat{f}(\cdot;\mathcal{D}_{\text{train}})over^ start_ARG italic_f end_ARG ( ⋅ ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) and UQ σ^(;𝒟train)^𝜎subscript𝒟train\hat{\sigma}(\cdot;\mathcal{D}_{\text{train}})over^ start_ARG italic_σ end_ARG ( ⋅ ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ). We compute the test-set coverage rate of a “prediction interval” [f^(x;𝒟train)λvalσ^(x;𝒟train),f^(x;𝒟train)+λvalσ^(x;𝒟train)]^𝑓𝑥subscript𝒟trainsubscript𝜆val^𝜎𝑥subscript𝒟train^𝑓𝑥subscript𝒟trainsubscript𝜆val^𝜎𝑥subscript𝒟train[\hat{f}(x;\mathcal{D}_{\text{train}})-\lambda_{\text{val}}\hat{\sigma}(x;% \mathcal{D}_{\text{train}}),\hat{f}(x;\mathcal{D}_{\text{train}})+\lambda_{% \text{val}}\hat{\sigma}(x;\mathcal{D}_{\text{train}})][ over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) - italic_λ start_POSTSUBSCRIPT val end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) , over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT val end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) ] by

(x,y)𝒟test(x[f^(x;𝒟train)λvalσ^(x;𝒟train),f^(x;𝒟train)+λvalσ^(x;𝒟train)]),subscriptsimilar-to𝑥𝑦subscript𝒟test𝑥^𝑓𝑥subscript𝒟trainsubscript𝜆val^𝜎𝑥subscript𝒟train^𝑓𝑥subscript𝒟trainsubscript𝜆val^𝜎𝑥subscript𝒟train\mathds{P}_{(x,y)\sim\mathcal{D}_{\text{test}}}(x\in[\hat{f}(x;\mathcal{D}_{% \text{train}})-\lambda_{\text{val}}\hat{\sigma}(x;\mathcal{D}_{\text{train}}),% \hat{f}(x;\mathcal{D}_{\text{train}})+\lambda_{\text{val}}\hat{\sigma}(x;% \mathcal{D}_{\text{train}})]),blackboard_P start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ∈ [ over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) - italic_λ start_POSTSUBSCRIPT val end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) , over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT val end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) ] ) ,

where λval=minλ0λsubscript𝜆valsubscript𝜆0𝜆\lambda_{\text{val}}=\min_{\lambda\geq 0}\lambdaitalic_λ start_POSTSUBSCRIPT val end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_λ ≥ 0 end_POSTSUBSCRIPT italic_λ is chosen such that (x,y)𝒟val(x[f^(x;𝒟train)λσ^(x;𝒟train),f^(x;𝒟train)+λσ^(x;𝒟train)])=1subscriptsimilar-to𝑥𝑦subscript𝒟val𝑥^𝑓𝑥subscript𝒟train𝜆^𝜎𝑥subscript𝒟train^𝑓𝑥subscript𝒟train𝜆^𝜎𝑥subscript𝒟train1\mathds{P}_{(x,y)\sim\mathcal{D}_{\text{val}}}(x\in[\hat{f}(x;\mathcal{D}_{% \text{train}})-\lambda\hat{\sigma}(x;\mathcal{D}_{\text{train}}),\hat{f}(x;% \mathcal{D}_{\text{train}})+\lambda\hat{\sigma}(x;\mathcal{D}_{\text{train}})]% )=1blackboard_P start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ∈ [ over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) - italic_λ over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) , over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) + italic_λ over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) ] ) = 1, i.e., the prediction interval is constructed such that it is just wide enough to perfectly covers the validation set. The higher the coverage is on the test set, the better is the SP+UQ intuited.

Model Configuration. We configure the following models: PseudoBO with SP and UQ as randomized prior mean and standard deviation (PseudoBO-RP) with the base SP as local kernel regression, and PseudoBO with SP and UQ as local kernel regression and a hybrid of minimum distance and randomized prior standard deviation (PseudoBO-KR-Hyb), defined as:

σHyb(x;𝒟n)=αnΔ(x,𝒟n)+(1αn)σRP(x,𝒟~n),withαn:=eΔ(x;Xn)n,formulae-sequencesubscript𝜎Hyb𝑥subscript𝒟𝑛subscript𝛼𝑛Δ𝑥subscript𝒟𝑛1subscript𝛼𝑛subscript𝜎RP𝑥subscript~𝒟𝑛assignwithsubscript𝛼𝑛superscript𝑒Δ𝑥subscript𝑋𝑛𝑛\sigma_{\text{Hyb}}(x;\mathcal{D}_{n})=\alpha_{n}\Delta(x,\mathcal{D}_{n})+(1-% \alpha_{n})\sigma_{\text{RP}}(x,\tilde{\mathcal{D}}_{n}),\hskip 2.84526pt\text% {with}\hskip 5.69054pt\alpha_{n}:=e^{-\Delta(x;X_{n})\cdot n},italic_σ start_POSTSUBSCRIPT Hyb end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x , caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + ( 1 - italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_σ start_POSTSUBSCRIPT RP end_POSTSUBSCRIPT ( italic_x , over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , with italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT := italic_e start_POSTSUPERSCRIPT - roman_Δ ( italic_x ; italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⋅ italic_n end_POSTSUPERSCRIPT , (16)

where 𝒟~nsubscript~𝒟𝑛\tilde{\mathcal{D}}_{n}over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the bootstrapped data. The intuition of constructing this UQ is that randomized prior and bootstrap** together can best quantify the local uncertainty. However, randomized prior with bootstrap** is not guaranteed to have the SNEB property. The latter can be achieved via a hybrid version that combines with the minimum-distance UQ, where αn1subscript𝛼𝑛1\alpha_{n}\rightarrow 1italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → 1 as x𝑥xitalic_x gets closer to Xnsubscript𝑋𝑛X_{n}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Lastly, we also try the combination of trust region and PseudoBO-KR-Hyb, shorthanded as PseudoBO-KR-Hyb-TR.

Optimizing EW. From our empirical investigation, we recommend using a scrambled Sobol sequence and pick the point with the largest EW among the sequence. This appears to enhance our performance compared with random search or BFGS.

6 Empirical Evaluations

In this section, we perform extensive empirical experiments on a wide range of tasks: a toy example for testing the quality of UQ, 4444 synthetic black-box function optimization problems, 4444 hyperparameter tuning tasks, and 2222 robotic tasks. All these tasks are challenging and contain many local minima. The baseline models we consider in all the tasks comprise the standard BO, Random Search (RS) [8], SMAC [29], TPE [7], BORE [65], and LFBO [61]. In particular, BO is implemented using BoTorch [4], and SMAC is implemented using AutoML [38]. In the large-scale robotic experiments in Section 6.4 and 6.5, we also include TuRBO for comparison.

For AF, the benchmarks BO, SMAC, and our PseudoBO variants use EI; TPE, BORE are designed to model PI; LFBO models EI; TuRBO is designed optimally with Thompson sampling. Additional details regarding our experiments can be found in Appendix 9.

6.1 Calibrating SP+UQ

We start by comparing the quality of different SP+UQ pairs via their CRRs: GP posterior mean + standard deviation, neural network (NN) + minimum distance (MD), randomized prior (RP) mean + standard deviation with base SP as local kernel regression (KR), and KR + Hybrid. We test on three one-dimensional (1D) benchmark functions: f1(x)=(sin(πw))2+(w1)2(1+sin(2πw)2)f_{1}(x)=(\sin(\pi w))^{2}+(w-1)^{2}\cdot(1+\sin(2\pi w)^{2})italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) = ( roman_sin ( italic_π italic_w ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_w - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( 1 + roman_sin ( 2 italic_π italic_w ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where w:=1+(x1)/4assign𝑤1𝑥14w:=1+(x-1)/4italic_w := 1 + ( italic_x - 1 ) / 4, x[10,10]𝑥1010x\in[-10,10]italic_x ∈ [ - 10 , 10 ]; f2(x)=20e0.2|x|ecos(2πx)+20esubscript𝑓2𝑥20superscript𝑒0.2𝑥superscript𝑒2𝜋𝑥20𝑒f_{2}(x)=-20e^{-0.2|x|}-e^{\cos(2\pi x)}+20-eitalic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) = - 20 italic_e start_POSTSUPERSCRIPT - 0.2 | italic_x | end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT roman_cos ( 2 italic_π italic_x ) end_POSTSUPERSCRIPT + 20 - italic_e, x[10,5]𝑥105x\in[-10,5]italic_x ∈ [ - 10 , 5 ]; f3(x)=sin(10πx)/(2x)+(x1)4subscript𝑓3𝑥10𝜋𝑥2𝑥superscript𝑥14f_{3}(x)=\sin(10\cdot\pi x)/(2x)+(x-1)^{4}italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x ) = roman_sin ( 10 ⋅ italic_π italic_x ) / ( 2 italic_x ) + ( italic_x - 1 ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, x[0.5,2.5]𝑥0.52.5x\in[0.5,2.5]italic_x ∈ [ 0.5 , 2.5 ]. We generate the training set 𝒟trainsubscript𝒟train\mathcal{D}_{\text{train}}caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT, the validation set 𝒟valsubscript𝒟val\mathcal{D}_{\text{val}}caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT and the test set 𝒟trainsubscript𝒟train\mathcal{D}_{\text{train}}caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT by uniformly sampling 20202020, 10101010 and 150150150150 points from the decision space at random, with their labels are evaluated by the objective function. The bisection method is used to find λvalsubscript𝜆val\lambda_{\text{val}}italic_λ start_POSTSUBSCRIPT val end_POSTSUBSCRIPT.

Refer to caption
Fig. 2: A sample run of GP, NN+MD, RP, KR+Hybrid (from left to right) to model the SP (the solid line) and the associated UQ (the shaded area). GP has CCR 0.85 and width 1.00, NN + MD has CCR 1.0 and width 1.89, RP has CCR 0.9 and width 12.73, and KR + Hybrid has CCR 0.98 and width 1.55.
f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT f3subscript𝑓3f_{3}italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT
SP+UQ pair CCR width CCR width CCR width
GP 0.91(±0.12)0.91plus-or-minus0.120.91(\pm 0.12)0.91 ( ± 0.12 ) 5.09(±2.38)5.09plus-or-minus2.385.09(\pm 2.38)5.09 ( ± 2.38 ) 0.93(±0.04)0.93plus-or-minus0.04\mathbf{0.93(\pm 0.04)}bold_0.93 ( ± bold_0.04 ) 6.29(±1.15)6.29plus-or-minus1.156.29(\pm 1.15)6.29 ( ± 1.15 ) 0.94(±0.04)0.94plus-or-minus0.040.94(\pm 0.04)0.94 ( ± 0.04 ) 1.57(±0.19)1.57plus-or-minus0.191.57(\pm 0.19)1.57 ( ± 0.19 )
NN + MD 0.92(±0.06)0.92plus-or-minus0.060.92(\pm 0.06)0.92 ( ± 0.06 ) 5.37(±1.90)5.37plus-or-minus1.905.37(\pm 1.90)5.37 ( ± 1.90 ) 0.91(±0.070.91(\pm 0.070.91 ( ± 0.07) 4.11(±0.93)4.11plus-or-minus0.934.11(\pm 0.93)4.11 ( ± 0.93 ) 0.96(±0.03)0.96plus-or-minus0.03\mathbf{0.96(\pm 0.03)}bold_0.96 ( ± bold_0.03 ) 1.59(±0.04)1.59plus-or-minus0.041.59(\pm 0.04)1.59 ( ± 0.04 )
RP 0.88(±0.10)0.88plus-or-minus0.100.88(\pm 0.10)0.88 ( ± 0.10 ) 62.29(±43.96)62.29plus-or-minus43.9662.29(\pm 43.96)62.29 ( ± 43.96 ) 0.89(±0.07)0.89plus-or-minus0.070.89(\pm 0.07)0.89 ( ± 0.07 ) 76.48(±27.02)76.48plus-or-minus27.0276.48(\pm 27.02)76.48 ( ± 27.02 ) 0.93(±0.04)0.93plus-or-minus0.040.93(\pm 0.04)0.93 ( ± 0.04 ) 20.46(±8.47)20.46plus-or-minus8.4720.46(\pm 8.47)20.46 ( ± 8.47 )
KR + Hybrid 0.93(±0.07)0.93plus-or-minus0.07\mathbf{0.93(\pm 0.07)}bold_0.93 ( ± bold_0.07 ) 5.34(±1.63)5.34plus-or-minus1.635.34(\pm 1.63)5.34 ( ± 1.63 ) 0.92(±0.08)0.92plus-or-minus0.080.92(\pm 0.08)0.92 ( ± 0.08 ) 4.45(±1.05)4.45plus-or-minus1.054.45(\pm 1.05)4.45 ( ± 1.05 ) 0.96(±0.04)0.96plus-or-minus0.04\mathbf{0.96(\pm 0.04)}bold_0.96 ( ± bold_0.04 ) 0.93(±0.36)0.93plus-or-minus0.360.93(\pm 0.36)0.93 ( ± 0.36 )
Table 1: Coverage rate and width of calibrated “prediction interval” on the test set. Results are repeated for 10 runs. The reported numbers are the empirical averages of these 10 repetitions, with the values in brackets representing the empirical standard deviations.

From Table 1, we see that KR + Hybrid has the highest CCR in two out of the three test cases. GP and NN + MD have the highest CCR in one test case. RP achieves lower CCRs and large calibrated widths possibly because while RP is designed to flag large uncertainty for unexplored areas, without bootstrap** on the training set, its uncertainty quantification ability for explored areas can be less precise. To be concrete, Fig. 2 shows some exemplified results on the three benchmark problems.

6.2 Synthetic Black-Box Function Optimization

We test our considered methods on the minimization of 4444 well-known benchmark functions, including a 2D Goldstein-price function, a 2D Drop-wave function, a 6D Hartmann function, and a 10D Ackley function. For all the methods, we run 100100100100 iterations with 5555 initial samples for the two 2222D functions, 500500500500 iterations with 10101010 initial samples for the other two with higher dimensions. From Fig. 3, we see that PseudoBO-KR-Hyb achieves the lowest objective in three out of four tasks within the query budget. TuRBO demonstrates competitive performance by reaching the lowest objective in two out of four tasks. Moreover, we record the cumulative regret in Fig. 4. PseudoBO-KR-Hyb appears to have sublinear cumulative regrets against iteration numbers in all cases and is the best in three out of four tasks.

Refer to caption
Fig. 3: Best objective queried against number of iterations for the synthetic black-box function minimization tasks. Each curve is an average over 10101010 runs.
Refer to caption
Fig. 4: Cumulative regret in the synthetic black-box function minimization tasks. Each curve is an average over 10101010 runs.

6.3 Hyperparameter Tuning

We test the methods on a more realistic task of neural network tuning [34]. Our main goal here is to find the optimal set of hyperparameters for a two-layered fully connected neural network, to achieve the best training outcomes on four UCI datasets [18]. There are 9999 hyperparameters, including initial learning rate, learning rate schedule, batch size, dropout rates, number of units, and activation function type for each of the two layers. This results in an expansive search space comprising 62,2086220862,20862 , 208 possible configurations in total. Fig. 5 reveals that among all the methods, PseudoBO-KR-Hyb is the fastest one to uncover the optimal configuration in three out of four tasks. PseudoBO-RP does not perform well here possibly because the randomized prior does not produce a good fit for the tree-structure function in this task. Relatively competitive models in this task are LFBO, BORE, and SMAC: LFBO achieves the best in the Naval task and quite close to PseudoBO-KR-Hyb in the Parkinsons task. BORE and SMAC are able to locate the optimal configurations in three out of four tasks. Besides, we also record the cumulative regret for all methods in Fig. 6. The plot shows that PseudoBO-KR-Hyb has sublinear cumulative regret in the iteration number and has the lowest cumulative regret in three out of four tasks. In the second task, BO has the lowest cumulative regret. Although BO finds the best configuration in only one dataset, its cumulative regrets in these tasks appear competitive.

Refer to caption
Fig. 5: Instant regret against number of iterations for the neural network tuning task. At each iteration, the instant regret is defined as the current best validation loss subtracted by the validation loss of the optimal structure. Each curve is an average over 25252525 runs.
Refer to caption
Fig. 6: Cumulative regret against number of iterations in the neural network tuning task. Each curve is an average over 25252525 runs.
Refer to caption
Fig. 7: Robot pushing task (first two) and Rover Trajectory task (last two). At each iteration, the best reward so far is recorded. Each curve is an average over 10101010 runs. Shaded area is formed by ±plus-or-minus\pm± one standard deviation. The x-axes denotes the number of iterations.

6.4 Robot Arm Pushing

This problem aims to control two robot arms with 14141414 parameters to push two objects to target positions. The reward is evaluated by the ending positions of the two objects pushed by the two robot arms. See [70] for more details. In this task, we follow the procedure and setup in [20], conducting a batch of 50 queries in each iteration and performing optimization with a budget of 10000100001000010000 queries. All methods are initialized with 100100100100 queries. As demonstrated in Fig. 7 (1), all PseudoBO variants, particularly PseudoBO-KR-Hyb, converge significantly faster than other methods iteration-wise and eventually achieve a superior reward, except for TuRBO. When measured in runtime, Fig. 7 (2) illustrates that PseudoBO variants take merely 20s to achieve a reward of around 9 while most of the other methods remain at around 7, and TPE is around 8.5. It is also noteworthy that BO ceases at the 9,250-th query due to GPU being out of memory. Another point to note is that Hyperopt (the package implementing TPE) does not support batch evaluation. Consequently, TPE queries and proposes the next point sequentially, allowing it to utilize more information than other methods. This means its performance might decline if batch evaluation were possible.

6.5 Rover Trajectory Planning

An additional robotic task we test is rover trajectory optimization by determining the locations of 30303030 points in a 2D plane, where the final reward is estimated by the ending position of the rover and cost incurred by collision. See [70] for more details. We follow the procedure and setup in [20] to perform a batch of queries of size 100 in each iteration and execute optimization within a 20,0002000020,00020 , 000-query budget. All methods are started with 200200200200 queries. Due to the memory issue, BO is difficult to run and not tested here. Besides, TPE is also excluded from the test, as sequentially querying and proposing next evaluations took excessively long time to finish 20,0002000020,00020 , 000 queries. Fig. 7 (3) shows that in terms of iterations, TuRBO converges slightly faster at the outset, but is later caught up by PseudoBO-KR-Hyb-TR. Besides, all PseudoBO variants converge more rapidly than other methods and ultimately secure considerably better final rewards. In terms of runtime, Fig. 7 (4) shows TuRBO converges as fast as PseudoBO-KR-Hyb-TR initially, but is surpassed by PseudoBO-KR-Hyb-TR at 125s, and later by PseudoBO-RP and PseudoBO-KR-Hyb at 400s. Eventually, PseudoBO-RP and PseudoBO-KR-Hyb yield rewards of around 3.8, PseudoBO-KR-Hyb-TR around 4.2 while TuRBO only reaches 3.5.

6.6 Runtime Record

We close our discussion by comparing the runtimes of the considered methods. Table 2 enumerates the runtimes of all tested methods to complete the target number of queries in various tasks. Some entries are vacant because the corresponding methods were not included in our experiments due to memory or excessive runtime issues. For smaller-scale experiments, we exclude PseudoBO-KR-Hyb-TR as scalability is not a concern; in the larger-scale tasks, BO is eliminated due to memory exhausion, while SMAC and TPE are not included due to their packages’ inability to support batch selection of candidates.

In small-sized tasks, our record reveals that RS and TPE demonstrate shorter runtimes than others. Immediately following these are the methods including BORE, LFBO, PseudoBO-RP, and PseudoBO-KR-Hyb, which all exhibit comparable runtimes. Despite being not as fast as the two methods above, their performances are usually more promising in comparison (especially in the neural network tuning task). While TuRBO is generally faster than BO, its runtime can be inconsistent. In the task of 10D synthetic function optimization, it considerably outpaces most of other methods, but this is not the case in the 12D neural network tuning task. This variability might arise from the restarting mechanism in the trust region method. Finally, both BO and SMAC consistently take the longest to complete the designated number of queries.

As the query budget grows up to tens of thousands in the robotic tasks, RS and TPE still stand in the fastest tier in terms of the runtime, immediately followed by the PseudoBO family. Notably, the scalability of the latter becomes even more apparent in these tasks in the sense that the time gap with TPE narrows and the PseudoBO family outperforms in terms of attained objective value significantly. Compared with other methods, in the 14D robotic task, the fastest PseudoBO model (PseudoBO-KR-Hyb-TR) is 5X faster than BORE, 5X than LFBO, 4X than TuRBO and 60X than BO. Likewise, in the 60D robotic task, PseudoBO-KR-Hyb-TR is 2.5X faster than BORE, 4X than LFBO and 5X than TuRBO. Fig. 8 shows a Pareto plot to illustrate the balance between the best objective value attained eventually and runtime.

Tasks Synthetic (10D) NN Tuning (12D) Robot Push (14D) Rover (60D)
Queries 500 300 10,000 20,000
RS 1.00 2.00 5.37 16.73
BO 399.90superscript399.90399.90^{*}399.90 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 791.40superscript791.40791.40^{*}791.40 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 6202.55superscript6202.556202.55^{*}6202.55 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT -
SMAC 747.4 200.14 - -
TPE 16.50 6.00 67.78 -
BORE 37.10 17.20 514.37 1361.19
LFBO 66.70 23.88 575.38 2130.78
TuRBO 20.5superscript20.520.5^{*}20.5 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 56.08superscript56.0856.08^{*}56.08 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 496.09superscript496.09496.09^{*}496.09 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 2572.82superscript2572.822572.82^{*}2572.82 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
PseudoBO-RP 42.03 32.90 127.55 513.08
PseudoBO-KR-Hyb 72.01 20.60 343.93 1453.00
PseudoBO-KR-Hyb-TR - - 107.22 510.60
Table 2: Runtime records in all types of tasks. Entries are in the unit of seconds. * on the top right of the entries means that the corresponding model was run on a GPU; otherwise, only CPU was used. Without GPU’s acceleration, even TuRBO (one of the most scalable state-of-the-art BO variants) takes approximately 4 hours to complete one run for the 60D Rover task.
Refer to caption
Fig. 8: Pareto plots of runtimes vs best objective values attained eventually in the robotic tasks (left for robot pushing and right for rover trajectory). In these visual representations, the effectiveness of a method increases alongside its position towards the upper right quadrant. The x-axis denotes the runtime with the transformation of log(1runtime)1runtime\log(\frac{1}{\text{runtime}})roman_log ( divide start_ARG 1 end_ARG start_ARG runtime end_ARG ) for the purpose of visualization.

7 Conclusion and Future Works

In this paper, we have re-evaluated the core principle of exploration-based black-box optimization. Our study is motivated by an apparent mismatch between BO theory and practice, in the sense that practical BO algorithms often lack theoretical guarantees offered by BO theory, which largely builds on GP-based procedures. To alleviate this mismatch, we propose a PseudoBO framework, which axiomatically dissects the minimal properties needed by the key ingredients in convergent black-box optimization algorithms. In particular, we show that if SP, UQ and AF satisfy their respective properties of local consistency, SNEB and improvement, then the assembled algorithm would eventually populate the decision space and consequently converges, and these properties can be further relaxed to achieve correspondingly relaxed consistency notions. Importantly, PseudoBO provides a recipe of algorithmic convergence beyond GP. That is, while GP is in the list of admissible ingredients, it is by no means the only option. From the recipe, we locate a combination of local regression as SP, a hybrid of randomized prior and minimum distance as UQ, and EI as AF, that gives rise to a consistent algorithm that performs competitively against existing state-of-the-art benchmarks across a wide range of tasks.

We view our work as a foundation for furthering the development of exploration-based black-box optimization that is beyond GP and theory-practice-balanced. Immediate extensions include the adaptation of our framework to handle noisy evaluations. In this situation, we face both epistemic uncertainty (due to lack of knowledge on the black-box objective function) and aleatory uncertainty (due to the evaluation noise). Much like the deterministic case, there has been works on extending GP-based algorithms to handle such problems, and our PseudoBO approach that dissects the key algorithmic ingredients could be generalized to tackle the additional aleatory uncertainty. Next is grey-box optimization. This consists of objective functions that can be composition of (multiple) black-box functions and analytically known functions. In this situation, we can individually model the SP and UQ for each underlying black-box functions and suitably aggregate them via a single AF, and achieve similar algorithmic consistency as in the full black-box case. A more challenging next step is to study rate results, but this likely requires an extent of opaque assumptions like in GP-based methods. Finally, an important direction is to provide a more principled approach to construct UQ, which is the core ingredient in inducing exploration, that can pair with specific SP in some arguably optimal sense, and also study the performance of such UQ relative to that of GP posterior.

Acknowledgements

We gratefully acknowledge support from the Amazon CAIT Fellowship, as well as the InnoHK initiative, the Government of the HKSAR, and Laboratory for AI-Powered Financial Technologies.

References

  • [1] R. Astudillo and P. Frazier. Bayesian optimization of composite functions. In International Conference on Machine Learning, pages 354–363. PMLR, 2019.
  • [2] R. Astudillo and P. Frazier. Bayesian optimization of function networks. Advances in neural information processing systems, 34:14463–14475, 2021.
  • [3] Y. Bai, H. Lam, T. Balch, and S. Vyetrenko. Efficient calibration of multi-agent simulation models from output series with bayesian optimization. In Proceedings of the Third ACM International Conference on AI in Finance, pages 437–445, 2022.
  • [4] M. Balandat, B. Karrer, D. R. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Bakshy. BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization. In Advances in Neural Information Processing Systems 33, 2020.
  • [5] R. R. Barton, H. Lam, and E. Song. Input uncertainty in stochastic simulation. In The Palgrave Handbook of Operations Research, pages 573–620. Springer, 2022.
  • [6] J. Bect, F. Bachoc, and D. Ginsbourger. A supermartingale approach to gaussian process based sequential design of experiments. arXiv preprint arXiv:1608.01118, 2016.
  • [7] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. Advances in neural information processing systems, 24, 2011.
  • [8] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 2012.
  • [9] S. Bochner. Harmonic analysis and the theory of probability. Courier Corporation, 2005.
  • [10] E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
  • [11] A. D. Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12(10), 2011.
  • [12] R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth. Bayesian optimization for learning gaits under uncertainty: An experimental comparison on a dynamic bipedal walker. Annals of Mathematics and Artificial Intelligence, 76:5–23, 2016.
  • [13] O. Chapelle and L. Li. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24, 2011.
  • [14] C. Chevalier and D. Ginsbourger. Fast computation of the multi-points expected improvement with applications in batch selection. In Learning and Intelligent Optimization: 7th International Conference, LION 7, Catania, Italy, January 7-11, 2013, Revised Selected Papers, pages 59–69. Springer, 2013.
  • [15] S. R. Chowdhury and A. Gopalan. No-regret algorithms for multi-task bayesian optimization. In International Conference on Artificial Intelligence and Statistics, pages 1873–1881. PMLR, 2021.
  • [16] C. G. Corlu, A. Akcay, and W. Xie. Stochastic simulation under input uncertainty: A review. Operations Research Perspectives, 7:100162, 2020.
  • [17] L. Csató and M. Opper. Sparse on-line gaussian processes. Neural computation, 14(3):641–668, 2002.
  • [18] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017.
  • [19] B. Efron. The jackknife, the bootstrap and other resampling plans. SIAM, 1982.
  • [20] D. Eriksson, M. Pearce, J. Gardner, R. D. Turner, and M. Poloczek. Scalable global optimization via local bayesian optimization. Advances in neural information processing systems, 32, 2019.
  • [21] P. Frazier, W. Powell, and S. Dayanik. The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing, 21(4):599–613, 2009.
  • [22] P. I. Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
  • [23] J. Gardner, G. Pleiss, K. Q. Weinberger, D. Bindel, and A. G. Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. Advances in neural information processing systems, 31, 2018.
  • [24] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
  • [25] J. González, Z. Dai, P. Hennig, and N. Lawrence. Batch bayesian optimization via local penalization. In Artificial intelligence and statistics, pages 648–657. PMLR, 2016.
  • [26] L. Györfi, M. Köhler, A. Krzyżak, and H. Walk. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002.
  • [27] A. Hebbal, L. Brevault, M. Balesdent, E.-G. Talbi, and N. Melab. Bayesian optimization using deep gaussian processes with applications to aerospace system design. Optimization and Engineering, 22:321–361, 2021.
  • [28] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. arXiv preprint arXiv:1309.6835, 2013.
  • [29] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pages 507–523. Springer, 2011.
  • [30] D. R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of global optimization, 21:345–383, 2001.
  • [31] D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455, 1998.
  • [32] K. Kandasamy, J. Schneider, and B. Póczos. High dimensional bayesian optimisation and bandits via additive models. In International conference on machine learning, pages 295–304. PMLR, 2015.
  • [33] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory: 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings 23, pages 199–213. Springer, 2012.
  • [34] A. Klein and F. Hutter. Tabular benchmarks for joint architecture and hyperparameter optimization. arXiv preprint arXiv:1905.04970, 2019.
  • [35] N. Lawrence, M. Seeger, and R. Herbrich. Fast sparse gaussian process methods: The informative vector machine. Advances in neural information processing systems, 15, 2002.
  • [36] M. Lázaro-Gredilla, J. Quinonero-Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal. Sparse spectrum gaussian process regression. The Journal of Machine Learning Research, 11:1865–1881, 2010.
  • [37] J. Li and I. O. Ryzhov. Convergence rates of epsilon-greedy global optimization under radial basis function interpolation. Stochastic Systems, 13(1):59–92, 2023.
  • [38] M. Lindauer, K. Eggensperger, M. Feurer, A. Biedenkapp, D. Deng, C. Benjamins, T. Ruhkopf, R. Sass, and F. Hutter. Smac3: A versatile bayesian optimization package for hyperparameter optimization. Journal of Machine Learning Research, 23(54):1–9, 2022.
  • [39] C. Liu and Y.-X. Wang. Global optimization with parametric function approximation. In International Conference on Machine Learning, pages 22113–22136. PMLR, 2023.
  • [40] M. Locatelli. Bayesian algorithms for one-dimensional global optimization. Journal of Global Optimization, 10(1):57, 1997.
  • [41] R. Marchant and F. Ramos. Bayesian optimisation for intelligent environmental monitoring. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 2242–2249. IEEE, 2012.
  • [42] R. Martinez-Cantin, N. de Freitas, A. Doucet, and J. A. Castellanos. Active policy learning for robot planning and exploration under uncertainty. In Robotics: Science and systems, volume 3, pages 321–328, 2007.
  • [43] M. McIntire, D. Ratner, and S. Ermon. Sparse gaussian processes for bayesian optimization. In UAI, 2016.
  • [44] S. Müller, M. Feurer, N. Hollmann, and F. Hutter. Pfns4bo: In-context learning for bayesian optimization. In International Conference on Machine Learning, pages 25444–25470. PMLR, 2023.
  • [45] S. Müller, N. Hollmann, S. P. Arango, J. Grabocka, and F. Hutter. Transformers can do bayesian inference. arXiv preprint arXiv:2112.10510, 2021.
  • [46] A. Nayebi, A. Munteanu, and M. Poloczek. A framework for bayesian optimization in embedded subspaces. In International Conference on Machine Learning, pages 4752–4761. PMLR, 2019.
  • [47] C. Oh, E. Gavves, and M. Welling. Bock: Bayesian optimization with cylindrical kernels. In International Conference on Machine Learning, pages 3868–3877. PMLR, 2018.
  • [48] I. Osband, J. Aslanides, and A. Cassirer. Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018.
  • [49] M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for global optimization. In 3rd international conference on learning and intelligent optimization (LION3), pages 1–15. Springer-Verlag Berlin, Germany, 2009.
  • [50] M. A. L. Pearce, M. Poloczek, and J. Branke. Bayesian optimization allowing for common random numbers. Operations Research, 70(6):3457–3472, 2022.
  • [51] V. Perrone, R. Jenatton, M. W. Seeger, and C. Archambeau. Scalable hyperparameter transfer learning. Advances in neural information processing systems, 31, 2018.
  • [52] H. Rakotoarison, S. Adriaensen, N. Mallik, S. Garibov, E. Bergman, and F. Hutter. In-context freeze-thaw bayesian optimization for hyperparameter optimization. arXiv preprint arXiv:2404.16795, 2024.
  • [53] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
  • [54] J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn. Design and analysis of computer experiments. Statistical Science, 4(4):409–423, 1989.
  • [55] M. W. Seeger, C. K. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse gaussian process regression. In International Workshop on Artificial Intelligence and Statistics, pages 254–261. PMLR, 2003.
  • [56] D. Sha, K. Ozbay, and Y. Ding. Applying bayesian optimization for calibration of transportation simulation models. Transportation Research Record, 2674(10):215–228, 2020.
  • [57] A. Shah and Z. Ghahramani. Parallel predictive entropy search for batch global optimization of expensive objective functions. Advances in neural information processing systems, 28, 2015.
  • [58] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
  • [59] E. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudo-inputs. Advances in neural information processing systems, 18, 2005.
  • [60] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. Adams. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pages 2171–2180. PMLR, 2015.
  • [61] J. Song, L. Yu, W. Neiswanger, and S. Ermon. A general recipe for likelihood-free bayesian optimization. In International Conference on Machine Learning, pages 20384–20404. PMLR, 2022.
  • [62] J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with robust bayesian neural networks. Advances in neural information processing systems, 29, 2016.
  • [63] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
  • [64] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
  • [65] L. C. Tiao, A. Klein, M. W. Seeger, E. V. Bonilla, C. Archambeau, and F. Ramos. Bore: Bayesian optimization by density-ratio estimation. In International Conference on Machine Learning, pages 10289–10300. PMLR, 2021.
  • [66] S. Toscano-Palmerin and P. I. Frazier. Bayesian optimization with expensive integrands. arXiv preprint arXiv:1803.08661, 2018.
  • [67] J. Ungredda, M. Pearce, and J. Branke. Bayesian optimisation vs. input uncertainty reduction. ACM Transactions on Modeling and Computer Simulation (TOMACS), 32(3):1–26, 2022.
  • [68] E. Vazquez and J. Bect. Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. Journal of Statistical Planning and inference, 140(11):3088–3095, 2010.
  • [69] J. Wang, S. C. Clark, E. Liu, and P. I. Frazier. Parallel bayesian global optimization of expensive functions. Operations Research, 68(6):1850–1865, 2020.
  • [70] Z. Wang, C. Gehring, P. Kohli, and S. Jegelka. Batched large-scale bayesian optimization in high-dimensional spaces. In International Conference on Artificial Intelligence and Statistics, pages 745–754. PMLR, 2018.
  • [71] C. White, W. Neiswanger, and Y. Savani. Bananas: Bayesian optimization with neural architectures for neural architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10293–10301, 2021.
  • [72] J. Wilson, V. Borovitskiy, A. Terenin, P. Mostowsky, and M. Deisenroth. Efficiently sampling functions from gaussian process posteriors. In International Conference on Machine Learning, pages 10292–10302. PMLR, 2020.
  • [73] J. Wu and P. Frazier. The parallel knowledge gradient method for batch bayesian optimization. Advances in neural information processing systems, 29, 2016.
  • [74] J. Xie, P. I. Frazier, and S. E. Chick. Bayesian optimization via simulation with pairwise sampling and correlated prior beliefs. Operations Research, 64(2):542–559, 2016.

Appendices

8 Proofs

Proof of Theorem 3.2.

We first prove part 1 of the theorem. Denote {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } as the sequence of evaluation points chosen by PseudoBO, and correspondingly 𝒟n={(x1,f(x1)),,(xn,f(xn))}subscript𝒟𝑛subscript𝑥1𝑓subscript𝑥1subscript𝑥𝑛𝑓subscript𝑥𝑛\mathcal{D}_{n}=\{(x_{1},f(x_{1})),\ldots,(x_{n},f(x_{n}))\}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) } is the data collection up to step n𝑛nitalic_n. Since 𝒳𝒳\mathcal{X}caligraphic_X is compact, {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } has a convergent subsequence, say {xπ(n)}subscript𝑥𝜋𝑛\{x_{\pi(n)}\}{ italic_x start_POSTSUBSCRIPT italic_π ( italic_n ) end_POSTSUBSCRIPT }. By Assumption 3.1 part 2, we have Wn(xπ(n);𝒟π(n)1)0subscript𝑊𝑛subscript𝑥𝜋𝑛subscript𝒟𝜋𝑛10W_{n}(x_{\pi(n)};\mathcal{D}_{\pi(n)-1})\to 0italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_π ( italic_n ) end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_π ( italic_n ) - 1 end_POSTSUBSCRIPT ) → 0 and hence lim infnWn(xn;𝒟n1)=0subscriptlimit-infimum𝑛subscript𝑊𝑛subscript𝑥𝑛subscript𝒟𝑛10\liminf_{{n}\to\infty}W_{n}(x_{n};\mathcal{D}_{n-1})=0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) = 0. On the other hand, suppose there is a point, say x𝒳superscript𝑥𝒳x^{\prime}\in\mathcal{X}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X, that is non-adherent to {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, i.e., there is no subsequence in {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } that converges to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Then, by Assumption 3.1 part 1, we have lim infnWn(x;𝒟n)>0subscriptlimit-infimum𝑛subscript𝑊𝑛superscript𝑥subscript𝒟𝑛0\liminf_{{n}\to\infty}W_{n}(x^{\prime};\mathcal{D}_{n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0. But this is impossible because PseudoBO requires at each step xn+1argmaxx𝒳Wn(x;𝒟n)subscript𝑥𝑛1subscriptargmax𝑥𝒳subscript𝑊𝑛𝑥subscript𝒟𝑛x_{n+1}\in\text{argmax}_{x\in\mathcal{X}}W_{n}(x;\mathcal{D}_{n})italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ argmax start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), hence a contradiction.

Part 2 of the theorem follows by a straightforward use of the continuity of f𝑓fitalic_f. More precisely, as 𝒳𝒳\mathcal{X}caligraphic_X is compact and f𝑓fitalic_f is continuous, xargmaxx𝒳f(x)superscript𝑥subscriptargmax𝑥𝒳𝑓𝑥x^{*}\in\text{argmax}_{x\in\mathcal{X}}f(x)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ argmax start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_f ( italic_x ) is well-defined and there is a subsequence of {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, say {xπ(n)}subscript𝑥superscript𝜋𝑛\{x_{\pi^{\prime}(n)}\}{ italic_x start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_n ) end_POSTSUBSCRIPT }, converging to xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by our first conclusion above. Then we have f(xπ(n))f(x)=Z𝑓subscript𝑥superscript𝜋𝑛𝑓superscript𝑥superscript𝑍f(x_{\pi^{\prime}(n)})\to f(x^{*})=Z^{*}italic_f ( italic_x start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_n ) end_POSTSUBSCRIPT ) → italic_f ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by the continuity of f𝑓fitalic_f. This gives

Zf(x^n)f(xπ(n))f(x)=Z,superscript𝑍𝑓superscriptsubscript^𝑥𝑛𝑓subscript𝑥superscript𝜋𝑛𝑓superscript𝑥superscript𝑍Z^{*}\geq f(\hat{x}_{n}^{*})\geq f(x_{\pi^{\prime}(n)})\to f(x^{*})=Z^{*},italic_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≥ italic_f ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_f ( italic_x start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_n ) end_POSTSUBSCRIPT ) → italic_f ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ,

which concludes f(x^n)Z𝑓superscriptsubscript^𝑥𝑛superscript𝑍f(\hat{x}_{n}^{*})\to Z^{*}italic_f ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) → italic_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. ∎

Proof of Theorem 3.6.

Consider an arbitrary finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). Suppose infnΔ(x,Sn)>0subscriptinfimum𝑛Δ𝑥subscript𝑆𝑛0\inf_{n}\Delta(x,S_{n})>0roman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0. Then, by Assumption 3.4 part 1, we have lim infnσ^(x;An)>0subscriptlimit-infimum𝑛^𝜎𝑥subscript𝐴𝑛0\liminf_{n\to\infty}\hat{\sigma}(x;A_{n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0. By Assumption 3.5 part 1, we must have lim infnWn(x,An)>0subscriptlimit-infimum𝑛subscript𝑊𝑛𝑥subscript𝐴𝑛0\liminf_{n\to\infty}W_{n}(x,A_{n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0.

To verify Assumption 3.1 part 2, consider a sequence xnXsubscript𝑥𝑛𝑋x_{n}\in Xitalic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_X that converges to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for an arbitrary finite-cardinality set sequence SnXsubscript𝑆𝑛𝑋S_{n}\subset Xitalic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ italic_X. By Assumption 3.4 part 2, we have σ^(xn;An1𝒟n1)0^𝜎subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛10\hat{\sigma}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to 0over^ start_ARG italic_σ end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) → 0 where 𝒟n1={(x1,f(x1)),,(xn1,f(xn1))}subscript𝒟𝑛1subscript𝑥1𝑓subscript𝑥1subscript𝑥𝑛1𝑓subscript𝑥𝑛1\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ) }. Moreover, by Assumption 3.3 we have f^(xn;An1𝒟n1)f(x)^𝑓subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to f(x^{\prime})over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) → italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Thus,

ζ(f^(xn;An1𝒟n1)maxΠf(An1𝒟n1))ζ(f^(xn;An1𝒟n1)f(xn1))ζ(0)0,𝜁^𝑓subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1subscriptΠ𝑓subscript𝐴𝑛1subscript𝒟𝑛1𝜁^𝑓subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓subscript𝑥𝑛1𝜁00\zeta(\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-\max\Pi_{f}(A_{n-1}\cup% \mathcal{D}_{n-1}))\leq\zeta(\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-f(x_{% n-1}))\to\zeta(0)\leq 0,italic_ζ ( over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ) ≤ italic_ζ ( over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ) → italic_ζ ( 0 ) ≤ 0 ,

by the assumed continuity of f𝑓fitalic_f and ζ𝜁\zetaitalic_ζ, the non-decreasing property of ζ𝜁\zetaitalic_ζ and ζ(0)0𝜁00\zeta(0)\leq 0italic_ζ ( 0 ) ≤ 0. Thus, by Assumption 3.5 part 2, we further have

Wn(xn;An1𝒟n1)=gn(ζ(f^(xn;An1𝒟n1)maxΠf(An1𝒟n1)),σ^(xn;An1𝒟n1))0.subscript𝑊𝑛subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1subscript𝑔𝑛𝜁^𝑓subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1subscriptΠ𝑓subscript𝐴𝑛1subscript𝒟𝑛1^𝜎subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛10W_{n}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})=g_{n}(\zeta(\hat{f}(x_{n};A_{n-1}% \cup\mathcal{D}_{n-1})-\max\Pi_{f}(A_{n-1}\cup\mathcal{D}_{n-1})),\hat{\sigma}% (x_{n};A_{n-1}\cup\mathcal{D}_{n-1}))\to 0.italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) = italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ζ ( over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ) , over^ start_ARG italic_σ end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ) → 0 .

This concludes Assumption 3.1 part 2. ∎

Proof of Theorem 3.9.

We first prove part 1 of the theorem. Denote {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } as the sequence of evaluation points chosen by PseudoBO, and correspondingly 𝒟n={(x1,f(x1)),,(xn,f(xn))}subscript𝒟𝑛subscript𝑥1𝑓subscript𝑥1subscript𝑥𝑛𝑓subscript𝑥𝑛\mathcal{D}_{n}=\{(x_{1},f(x_{1})),\ldots,(x_{n},f(x_{n}))\}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) } is the data collection up to step n𝑛nitalic_n. Since 𝒳𝒳\mathcal{X}caligraphic_X is compact, {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } has a convergent subsequence, say {xπ(n)}subscript𝑥𝜋𝑛\{x_{\pi(n)}\}{ italic_x start_POSTSUBSCRIPT italic_π ( italic_n ) end_POSTSUBSCRIPT }. By Assumption 3.8 part 2, we have Wn(xπ(n);𝒟π(n)1)0subscript𝑊𝑛subscript𝑥𝜋𝑛subscript𝒟𝜋𝑛10W_{n}(x_{\pi(n)};\mathcal{D}_{\pi(n)-1})\to 0italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_π ( italic_n ) end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_π ( italic_n ) - 1 end_POSTSUBSCRIPT ) → 0 and hence lim infnWn(xn;𝒟n1)=0subscriptlimit-infimum𝑛subscript𝑊𝑛subscript𝑥𝑛subscript𝒟𝑛10\liminf_{{n}\to\infty}W_{n}(x_{n};\mathcal{D}_{n-1})=0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) = 0. On the other hand, suppose there is a point, say x𝒳superscript𝑥𝒳x^{\prime}\in\mathcal{X}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X, that satisfies infnΔ(x,Xn)>δsubscriptinfimum𝑛Δsuperscript𝑥subscript𝑋𝑛𝛿\inf_{n}\Delta(x^{\prime},X_{n})>\deltaroman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ. Then, by Assumption 3.8 part 1, we have lim infnWn(x;𝒟n)>0subscriptlimit-infimum𝑛subscript𝑊𝑛superscript𝑥subscript𝒟𝑛0\liminf_{{n}\to\infty}W_{n}(x^{\prime};\mathcal{D}_{n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0. But this is impossible because PseudoBO requires at each step xn+1argmaxx𝒳Wn(x;𝒟n)subscript𝑥𝑛1subscriptargmax𝑥𝒳subscript𝑊𝑛𝑥subscript𝒟𝑛x_{n+1}\in\text{argmax}_{x\in\mathcal{X}}W_{n}(x;\mathcal{D}_{n})italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ argmax start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), hence a contradiction.

Now we prove part 2 of the theorem. First, as 𝒳𝒳\mathcal{X}caligraphic_X is compact and f𝑓fitalic_f is continuous, xargmaxx𝒳f(x)superscript𝑥subscriptargmax𝑥𝒳𝑓𝑥x^{*}\in\text{argmax}_{x\in\mathcal{X}}f(x)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ argmax start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_f ( italic_x ) is well-defined. By our first conclusion above, for any η>0𝜂0\eta>0italic_η > 0, we can find an xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT such that xnxδ+ηnormsubscript𝑥𝑛superscript𝑥𝛿𝜂\|x_{n}-x^{*}\|\leq\delta+\eta∥ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ≤ italic_δ + italic_η. Then we have

f(x^n)f(xn)f(x)L(δ+η),𝑓superscriptsubscript^𝑥𝑛𝑓subscript𝑥𝑛𝑓superscript𝑥𝐿𝛿𝜂f(\hat{x}_{n}^{*})\geq f(x_{n})\geq f(x^{*})-L(\delta+\eta),italic_f ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ italic_f ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_L ( italic_δ + italic_η ) ,

by the Lipschitzness of f𝑓fitalic_f. Since η𝜂\etaitalic_η is arbitrary, we have

lim infnf(x^n)f(x)Lδ=ZLδsubscriptlimit-infimum𝑛𝑓superscriptsubscript^𝑥𝑛𝑓superscript𝑥𝐿𝛿superscript𝑍𝐿𝛿\liminf_{n}f(\hat{x}_{n}^{*})\geq f(x^{*})-L\delta=Z^{*}-L\deltalim inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_f ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_f ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_L italic_δ = italic_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_L italic_δ

which concludes the theorem. ∎

Proof of Theorem 3.12.

Consider an arbitrary finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X, and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). Suppose infnΔ(x,Sn)>δsubscriptinfimum𝑛Δ𝑥subscript𝑆𝑛𝛿\inf_{n}\Delta(x,S_{n})>\deltaroman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ. Then, by Assumption 3.11 part 1, we have lim infnσ^(x;An)>0subscriptlimit-infimum𝑛^𝜎𝑥subscript𝐴𝑛0\liminf_{n\to\infty}\hat{\sigma}(x;A_{n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0. By Assumption 3.5 part 1, we must have lim infnWn(x,An)>0subscriptlimit-infimum𝑛subscript𝑊𝑛𝑥subscript𝐴𝑛0\liminf_{n\to\infty}W_{n}(x,A_{n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0.

To verify Assumption 3.8 part 2, consider a sequence xnXsubscript𝑥𝑛𝑋x_{n}\in Xitalic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_X that converges to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for an arbitrary finite-cardinality set sequence SnXsubscript𝑆𝑛𝑋S_{n}\subset Xitalic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ italic_X. By Assumption 3.11 part 2, we have σ^(xn;An1𝒟n1)0^𝜎subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛10\hat{\sigma}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to 0over^ start_ARG italic_σ end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) → 0 where 𝒟n1={(x1,f(x1)),,(xn1,f(xn1))}subscript𝒟𝑛1subscript𝑥1𝑓subscript𝑥1subscript𝑥𝑛1𝑓subscript𝑥𝑛1\mathcal{D}_{n-1}=\{(x_{1},f(x_{1})),\ldots,(x_{n-1},f(x_{n-1}))\}caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ) }. Moreover, by Assumption 3.10 we have lim supnf^(xn;An1𝒟n1)f(x)+ϵsubscriptlimit-supremum𝑛^𝑓subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥italic-ϵ\limsup_{n}\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\leq f(x^{\prime})+\epsilonlim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ≤ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ϵ. Thus,

ζ(f^(xn;An1𝒟n1)maxΠf(An1𝒟n1)ϵ)ζ(f^(xn;An1𝒟n1)f(xn1)ϵ)𝜁^𝑓subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1subscriptΠ𝑓subscript𝐴𝑛1subscript𝒟𝑛1italic-ϵ𝜁^𝑓subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓subscript𝑥𝑛1italic-ϵ\zeta(\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-\max\Pi_{f}(A_{n-1}\cup% \mathcal{D}_{n-1})-\epsilon)\leq\zeta(\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-% 1})-f(x_{n-1})-\epsilon)italic_ζ ( over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_ϵ ) ≤ italic_ζ ( over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_ϵ )

and so

lim supnζ(f^(xn;An1𝒟n1)maxΠf(An1𝒟n1)ϵ)subscriptlimit-supremum𝑛𝜁^𝑓subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1subscriptΠ𝑓subscript𝐴𝑛1subscript𝒟𝑛1italic-ϵ\displaystyle\limsup_{n}\zeta(\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-\max% \Pi_{f}(A_{n-1}\cup\mathcal{D}_{n-1})-\epsilon)lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ζ ( over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_ϵ )
\displaystyle\leq ζ(lim supn{f^(xn;An1𝒟n1)f(xn1)ϵ})𝜁subscriptlimit-supremum𝑛^𝑓subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓subscript𝑥𝑛1italic-ϵ\displaystyle\zeta(\limsup_{n}\{\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-f(% x_{n-1})-\epsilon\})italic_ζ ( lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT { over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_ϵ } )
\displaystyle\leq ζ(f(x)+ϵf(x)ϵ)𝜁𝑓superscript𝑥italic-ϵ𝑓superscript𝑥italic-ϵ\displaystyle\zeta(f(x^{\prime})+\epsilon-f(x^{*})-\epsilon)italic_ζ ( italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ϵ - italic_f ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_ϵ )
\displaystyle\leq ζ(0)𝜁0\displaystyle\zeta(0)italic_ζ ( 0 )
\displaystyle\leq 00\displaystyle 0

by the assumed continuity of f𝑓fitalic_f and ζ𝜁\zetaitalic_ζ, the non-decreasing property of ζ𝜁\zetaitalic_ζ and ζ(0)0𝜁00\zeta(0)\leq 0italic_ζ ( 0 ) ≤ 0. Thus, by Assumption 3.5 part 2, we further have

Wn(xn;An1𝒟n1)=gn(ζ(f^(xn;An1𝒟n1)maxΠf(An1𝒟n1)ϵ),σ^(xn;An1𝒟n1))0.subscript𝑊𝑛subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1subscript𝑔𝑛𝜁^𝑓subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1subscriptΠ𝑓subscript𝐴𝑛1subscript𝒟𝑛1italic-ϵ^𝜎subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛10W_{n}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})=g_{n}(\zeta(\hat{f}(x_{n};A_{n-1}% \cup\mathcal{D}_{n-1})-\max\Pi_{f}(A_{n-1}\cup\mathcal{D}_{n-1})-\epsilon),% \hat{\sigma}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1}))\to 0.italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) = italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ζ ( over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - roman_max roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_ϵ ) , over^ start_ARG italic_σ end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ) → 0 .

This concludes Assumption 3.8 part 2. ∎

Proof of Proposition 4.1.

Proposition 10 in [68] assumes that f𝑓f\in\mathcal{H}italic_f ∈ caligraphic_H, where \mathcal{H}caligraphic_H denotes the RKHS of 𝒦𝒦\mathcal{K}caligraphic_K. Consider two sequences {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, {yn}𝒳subscript𝑦𝑛𝒳\{y_{n}\}\subset\mathcal{X}{ italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ⊂ caligraphic_X that is compact. Suppose ynysubscript𝑦𝑛superscript𝑦y_{n}\rightarrow y^{\prime}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is adherent to {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. This is the condition (i) in Proposition 10 of [68]. A partial result of Proposition 10 demonstrates that under this condition, GP mean predictor f^GP(yn;𝒟n)f(y)subscript^𝑓𝐺𝑃subscript𝑦𝑛subscript𝒟𝑛𝑓superscript𝑦\hat{f}_{GP}(y_{n};\mathcal{D}_{n})\rightarrow f(y^{\prime})over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → italic_f ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

To further show it satisfying our local consistency assumption, consider a sequence xn𝒳subscript𝑥𝑛𝒳x_{n}\in\mathcal{X}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_X that converges to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for an arbitrary finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X. By definition, xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is an adherent point of SnXnsubscript𝑆𝑛subscript𝑋𝑛S_{n}\cup X_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∪ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Therefore, we have f^(xn;An1𝒟n1)f(x)^𝑓subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥\hat{f}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})\to f(x^{\prime})over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) → italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). ∎

Proof of Proposition 4.2.

Suppose {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is a sequence in 𝒳𝒳\mathcal{X}caligraphic_X that converges to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for an arbitrary finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X. Define xn:=argminyXn1Sn1xnyassignsubscriptsuperscript𝑥𝑛subscriptargmin𝑦subscript𝑋𝑛1subscript𝑆𝑛1normsubscript𝑥𝑛𝑦x^{\prime}_{n}:=\operatorname*{arg\,min}_{y\in X_{n-1}\cup S_{n-1}}||x_{n}-y||italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT := start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_y ∈ italic_X start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_y | | where Xn1={x1,,xn1}subscript𝑋𝑛1subscript𝑥1subscript𝑥𝑛1X_{n-1}=\{x_{1},\ldots,x_{n-1}\}italic_X start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT }. We have f^NN(xn;An1𝒟n1)=f(xn)subscript^𝑓NNsubscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓subscriptsuperscript𝑥𝑛\hat{f}_{\text{NN}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})=f(x^{\prime}_{n})over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT NN end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) = italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). Note that xnxsubscriptsuperscript𝑥𝑛superscript𝑥x^{\prime}_{n}\to x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Thus, f^NN(xn;An1𝒟n1)=f(xn)f(x)subscript^𝑓NNsubscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓subscriptsuperscript𝑥𝑛𝑓superscript𝑥\hat{f}_{\text{NN}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})=f(x^{\prime}_{n})\to f% (x^{\prime})over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT NN end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) = italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) by the continuity of f𝑓fitalic_f. ∎

Proof of Proposition 4.3.

Suppose {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is a sequence in 𝒳𝒳\mathcal{X}caligraphic_X that converges to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for an arbitrary finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X. We have

|f^Net(xn;An1𝒟n1)f^Net(xn1;An1𝒟n1)|subscript^𝑓Netsubscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1subscript^𝑓Netsubscript𝑥𝑛1subscript𝐴𝑛1subscript𝒟𝑛1\displaystyle|\hat{f}_{\text{Net}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-\hat{f}% _{\text{Net}}(x_{n-1};A_{n-1}\cup\mathcal{D}_{n-1})|| over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Net end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Net end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) | (17)
\displaystyle\leq Lxnxn1 since f^Net is LLipschitz𝐿delimited-∥∥subscript𝑥𝑛subscript𝑥𝑛1 since f^Net is LLipschitz\displaystyle L\left\lVert x_{n}-x_{n-1}\right\rVert\text{\ \ \ \ since $\hat{% f}_{\text{Net}}$ is $L-$Lipschitz}italic_L ∥ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ since over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Net end_POSTSUBSCRIPT is italic_L - Lipschitz
\displaystyle\to 0 by the convergence and hence Cauchy property of {xn}0 by the convergence and hence Cauchy property of {xn}\displaystyle 0\text{\ \ \ \ by the convergence and hence Cauchy property of $% \{x_{n}\}$}0 by the convergence and hence Cauchy property of { italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }

Thus,

|f^Net(xn;An1𝒟n1)f(x)|subscript^𝑓Netsubscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥\displaystyle|\hat{f}_{\text{Net}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-f(x^{% \prime})|| over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Net end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |
\displaystyle\leq |f^Net(xn;An1𝒟n1)f^Net(xn1;An1𝒟n1)|+|f^Net(xn1;An1𝒟n1)f(x)|subscript^𝑓Netsubscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1subscript^𝑓Netsubscript𝑥𝑛1subscript𝐴𝑛1subscript𝒟𝑛1subscript^𝑓Netsubscript𝑥𝑛1subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥\displaystyle|\hat{f}_{\text{Net}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-\hat{f}% _{\text{Net}}(x_{n-1};A_{n-1}\cup\mathcal{D}_{n-1})|+|\hat{f}_{\text{Net}}(x_{% n-1};A_{n-1}\cup\mathcal{D}_{n-1})-f(x^{\prime})|| over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Net end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Net end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) | + | over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Net end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |
\displaystyle\to 00\displaystyle 0

by the triangle inequality, (17) and f^Net(xn1;An1𝒟n1)=f(x)subscript^𝑓Netsubscript𝑥𝑛1subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥\hat{f}_{\text{Net}}(x_{n-1};A_{n-1}\cup\mathcal{D}_{n-1})=f(x^{\prime})over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Net end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) = italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) thanks to the over-parametrization. ∎

Proof of Proposition 4.4.

Suppose xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a sequence in 𝒳𝒳\mathcal{X}caligraphic_X that converges to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for an arbitrary finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\in\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_X. Then there exists N>0𝑁0N>0italic_N > 0 such that for all nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, we have xnxdiam(Rl(x,θ))normsubscript𝑥𝑛superscript𝑥diamsubscript𝑅𝑙superscript𝑥𝜃\|x_{n}-x^{\prime}\|\leq\text{diam}(R_{l(x^{\prime},\theta)})∥ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ ≤ diam ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ ) end_POSTSUBSCRIPT ). For nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, we have

|f^tree(xn;An1𝒟n1)f(x)|subscript^𝑓treesubscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥\displaystyle|\hat{f}_{\text{tree}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-f(x^{% \prime})|| over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT tree end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | =|i=1n1𝟙{xiRl(xn,θ)}i=1n1𝟙{xiRl(xn,θ)}f(xi)f(x)|absentsuperscriptsubscript𝑖1𝑛11subscript𝑥𝑖subscript𝑅𝑙subscript𝑥𝑛𝜃superscriptsubscript𝑖1𝑛11subscript𝑥𝑖subscript𝑅𝑙subscript𝑥𝑛𝜃𝑓subscript𝑥𝑖𝑓superscript𝑥\displaystyle=|\sum_{i=1}^{n-1}\frac{\mathbbm{1}\{x_{i}\in R_{l(x_{n},\theta)}% \}}{\sum_{i=1}^{n-1}\mathbbm{1}\{x_{i}\in R_{l(x_{n},\theta)}\}}f(x_{i})-f(x^{% \prime})|= | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT divide start_ARG blackboard_1 { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT } end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT blackboard_1 { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT } end_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |
=|1|n1(Rl(xn,θ))|xin1(Rl(xn,θ))f(xi)f(x)|absent1subscript𝑛1subscript𝑅𝑙subscript𝑥𝑛𝜃subscriptsubscript𝑥𝑖subscript𝑛1subscript𝑅𝑙subscript𝑥𝑛𝜃𝑓subscript𝑥𝑖𝑓superscript𝑥\displaystyle=|\frac{1}{|\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})|}\sum_{x_{i}% \in\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})}f(x_{i})-f(x^{\prime})|= | divide start_ARG 1 end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT ) | end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |
1|n1(Rl(xn,θ))|xin1(Rl(xn,θ))|f(xi)f(x)|absent1subscript𝑛1subscript𝑅𝑙subscript𝑥𝑛𝜃subscriptsubscript𝑥𝑖subscript𝑛1subscript𝑅𝑙subscript𝑥𝑛𝜃𝑓subscript𝑥𝑖𝑓superscript𝑥\displaystyle\leq\frac{1}{|\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})|}\sum_{x_{i}% \in\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})}|f(x_{i})-f(x^{\prime})|≤ divide start_ARG 1 end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT ) | end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT | italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |
1|n1(Rl(xn,θ))|xin1(Rl(xn,θ))Lxixabsent1subscript𝑛1subscript𝑅𝑙subscript𝑥𝑛𝜃subscriptsubscript𝑥𝑖subscript𝑛1subscript𝑅𝑙subscript𝑥𝑛𝜃𝐿normsubscript𝑥𝑖superscript𝑥\displaystyle\leq\frac{1}{|\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})|}\sum_{x_{i}% \in\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})}L\|x_{i}-x^{\prime}\|≤ divide start_ARG 1 end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT ) | end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_L ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥
1|n1(Rl(xn,θ))|in1(Rl(xn,θ))L(diam(Rl(xn;θ)+diam(Rl(x;θ)))\displaystyle\leq\frac{1}{|\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})|}\sum_{i\in% \mathcal{I}_{n-1}(R_{l(x_{n},\theta)})}L\cdot(\text{diam}(R_{l(x_{n};\theta)}+% \text{diam}(R_{l(x^{\prime};\theta)}))≤ divide start_ARG 1 end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT ) | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_L ⋅ ( diam ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_θ ) end_POSTSUBSCRIPT + diam ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ ) end_POSTSUBSCRIPT ) )
=2Lmaxx𝒳diam(Rl(x;θ)),absent2𝐿subscript𝑥𝒳diamsubscript𝑅𝑙𝑥𝜃\displaystyle=2L\cdot\max_{x\in\mathcal{X}}\text{diam}(R_{l(x;\theta)}),= 2 italic_L ⋅ roman_max start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT diam ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x ; italic_θ ) end_POSTSUBSCRIPT ) ,

where |n1(Rl(xn,θ))|subscript𝑛1subscript𝑅𝑙subscript𝑥𝑛𝜃|\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})|| caligraphic_I start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT ) | represents the cardinality of the set n1(Rl(xn,θ))subscript𝑛1subscript𝑅𝑙subscript𝑥𝑛𝜃\mathcal{I}_{n-1}(R_{l(x_{n},\theta)})caligraphic_I start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT ) comprising all points in An1𝒟n1subscript𝐴𝑛1subscript𝒟𝑛1A_{n-1}\cup\mathcal{D}_{n-1}italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT located inside the region Rl(xn,θ)subscript𝑅𝑙subscript𝑥𝑛𝜃R_{l(x_{n},\theta)}italic_R start_POSTSUBSCRIPT italic_l ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ ) end_POSTSUBSCRIPT, the last but one inequality follows by the L𝐿Litalic_L-Lipschitz property, and the last inequality follows from the triangle inequality. ∎

Proof of Proposition 4.5.

Suppose xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a sequence in 𝒳𝒳\mathcal{X}caligraphic_X that converges to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for an arbitrary finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X. Hence xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is Cauchy. So, for any n𝑛nitalic_n sufficiently large, we have xnxn1Mhnormsubscript𝑥𝑛subscript𝑥𝑛1𝑀\|x_{n}-x_{n-1}\|\leq Mh∥ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ ≤ italic_M italic_h. For any such n𝑛nitalic_n, we have

f^KR(xn;An1𝒟n1)subscript^𝑓KRsubscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1\displaystyle\hat{f}_{\text{KR}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT KR end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) =in1K(xnxih)f(xi)in1K(xnxih)absentsubscript𝑖𝑛1𝐾normsubscript𝑥𝑛subscript𝑥𝑖𝑓subscript𝑥𝑖subscript𝑖𝑛1𝐾normsubscript𝑥𝑛subscript𝑥𝑖\displaystyle=\frac{\sum_{i\leq n-1}K\left(\frac{\|x_{n}-x_{i}\|}{h}\right)f(x% _{i})}{\sum_{i\leq n-1}K\left(\frac{\|x_{n}-x_{i}\|}{h}\right)}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ≤ italic_n - 1 end_POSTSUBSCRIPT italic_K ( divide start_ARG ∥ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG start_ARG italic_h end_ARG ) italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ≤ italic_n - 1 end_POSTSUBSCRIPT italic_K ( divide start_ARG ∥ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG start_ARG italic_h end_ARG ) end_ARG
=in1K(xnxih)(f(xn)+en)in1K(xnxih)absentsubscript𝑖𝑛1𝐾normsubscript𝑥𝑛subscript𝑥𝑖𝑓subscript𝑥𝑛subscript𝑒𝑛subscript𝑖𝑛1𝐾normsubscript𝑥𝑛subscript𝑥𝑖\displaystyle=\frac{\sum_{i\leq n-1}K\left(\frac{\|x_{n}-x_{i}\|}{h}\right)(f(% x_{n})+e_{n})}{\sum_{i\leq n-1}K\left(\frac{\|x_{n}-x_{i}\|}{h}\right)}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ≤ italic_n - 1 end_POSTSUBSCRIPT italic_K ( divide start_ARG ∥ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG start_ARG italic_h end_ARG ) ( italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ≤ italic_n - 1 end_POSTSUBSCRIPT italic_K ( divide start_ARG ∥ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG start_ARG italic_h end_ARG ) end_ARG

where |en|Lxixnsubscript𝑒𝑛𝐿normsubscript𝑥𝑖subscript𝑥𝑛|e_{n}|\leq L\|x_{i}-x_{n}\|| italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | ≤ italic_L ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ by the L𝐿Litalic_L-Lipschitzness of f𝑓fitalic_f. Thus, the above is equal to f(xn)+e~n𝑓subscript𝑥𝑛subscript~𝑒𝑛f(x_{n})+\tilde{e}_{n}italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT where |e~n|LMhsubscript~𝑒𝑛𝐿𝑀|\tilde{e}_{n}|\leq LMh| over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | ≤ italic_L italic_M italic_h since K𝐾Kitalic_K has support [0,M]0𝑀[0,M][ 0 , italic_M ].

So, we have

lim supn|f^Ker(xn;An1𝒟n1)f(x)|subscriptlimit-supremum𝑛subscript^𝑓Kersubscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥\displaystyle\limsup_{n}|\hat{f}_{\text{Ker}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-% 1})-f(x^{\prime})|lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Ker end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |
\displaystyle\leq lim supn|f^Ker(xn;An1𝒟n1)f(xn)|+limn|f(xn)f(x)|subscriptlimit-supremum𝑛subscript^𝑓Kersubscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓subscript𝑥𝑛subscript𝑛𝑓subscript𝑥𝑛𝑓superscript𝑥\displaystyle\limsup_{n}|\hat{f}_{\text{Ker}}(x_{n};A_{n-1}\cup\mathcal{D}_{n-% 1})-f(x_{n})|+\lim_{n}|f(x_{n})-f(x^{\prime})|lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Ker end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) | + roman_lim start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |
\displaystyle\leq LMh𝐿𝑀\displaystyle LMhitalic_L italic_M italic_h

which concludes the proposition. ∎

Proof of Proposition 4.6.

Suppose {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } converges to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for an arbitrary finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X. For each rsimilar-to𝑟r\sim\mathcal{R}italic_r ∼ caligraphic_R, by ϵitalic-ϵ\epsilonitalic_ϵ-relaxed local consistency and the continuity of r𝑟ritalic_r, we have

f^(r)(xn;An1𝒟n1)superscript^𝑓𝑟subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1\displaystyle\hat{f}^{(r)}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) =r(xn)+f^(xn;An1(r)𝒟n1(r))absent𝑟subscript𝑥𝑛^𝑓subscript𝑥𝑛subscriptsuperscript𝐴𝑟𝑛1subscriptsuperscript𝒟𝑟𝑛1\displaystyle=r(x_{n})+\hat{f}(x_{n};A^{(r)}_{n-1}\cup\mathcal{D}^{(r)}_{n-1})= italic_r ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT )
=r(x)+f(x)r(x)+en=f(x)+enabsent𝑟superscript𝑥𝑓superscript𝑥𝑟superscript𝑥subscript𝑒𝑛𝑓superscript𝑥subscript𝑒𝑛\displaystyle=r(x^{\prime})+f(x^{\prime})-r(x^{\prime})+e_{n}=f(x^{\prime})+e_% {n}= italic_r ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_r ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (18)

where lim supn|en|ϵsubscriptlimit-supremum𝑛subscript𝑒𝑛italic-ϵ\limsup_{n}|e_{n}|\leq\epsilonlim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | ≤ italic_ϵ. Thus, since f,f^(r)𝑓superscript^𝑓𝑟f,\hat{f}^{(r)}italic_f , over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT are bounded, we have

lim supn|Er[f^(r)(xn;An1𝒟n1)]f(x)|Er[lim supn|f^(r)(xn;An1𝒟n1)f(x)|]ϵ.subscriptlimit-supremum𝑛subscript𝐸similar-to𝑟delimited-[]superscript^𝑓𝑟subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥subscript𝐸similar-to𝑟delimited-[]subscriptlimit-supremum𝑛superscript^𝑓𝑟subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥italic-ϵ\limsup_{n}|E_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{n};A_{n-1}\cup\mathcal{D}_{n% -1})]-f(x^{\prime})|\leq E_{r\sim\mathcal{R}}[\limsup_{n}|\hat{f}^{(r)}(x_{n};% A_{n-1}\cup\mathcal{D}_{n-1})-f(x^{\prime})|]\leq\epsilon.lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_E start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ] - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≤ italic_E start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ] ≤ italic_ϵ .

Proof of Proposition 4.7.

Suppose {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } converges to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for an arbitrary finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X. Given a finite class of ϵitalic-ϵ\epsilonitalic_ϵ-relaxed locally consistent SPs ^={f^i}^subscript^𝑓𝑖\hat{\mathcal{F}}=\{\hat{f}_{i}\}over^ start_ARG caligraphic_F end_ARG = { over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, we have

lim supn|i=1|^|αif^i(xn;An1𝒟n1)f(x)|i=1|^|αilim supn|f^i(xn;An1𝒟n1)f(x)|ϵ,subscriptlimit-supremum𝑛superscriptsubscript𝑖1^subscript𝛼𝑖subscript^𝑓𝑖subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥superscriptsubscript𝑖1^subscript𝛼𝑖subscriptlimit-supremum𝑛subscript^𝑓𝑖subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥italic-ϵ\limsup_{n}\left|\sum_{i=1}^{|\hat{\mathcal{F}}|}\alpha_{i}\hat{f}_{i}(x_{n};A% _{n-1}\cup\mathcal{D}_{n-1})-f(x^{\prime})\right|\leq\sum_{i=1}^{|\hat{% \mathcal{F}}|}\alpha_{i}\limsup_{n}|\hat{f}_{i}(x_{n};A_{n-1}\cup\mathcal{D}_{% n-1})-f(x^{\prime})|\leq\epsilon,lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | over^ start_ARG caligraphic_F end_ARG | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | over^ start_ARG caligraphic_F end_ARG | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≤ italic_ϵ ,

with the assumption that i=1|^|αi=1superscriptsubscript𝑖1^subscript𝛼𝑖1\sum_{i=1}^{|\hat{\mathcal{F}}|}\alpha_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | over^ start_ARG caligraphic_F end_ARG | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1. ∎

Recall the NEB property of GP in [68]:

Assumption 8.1 (NEB property).

A GP has the NEB property if, for all sequence {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } in 𝒳𝒳\mathcal{X}caligraphic_X and all xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in 𝒳𝒳\mathcal{X}caligraphic_X, the following statements are equivalent:

  1. 1.

    xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is an adherent point of {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, i.e., there is a subsequence in {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } that converges to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

  2. 2.

    σ^GP2(x;𝒟n)0superscriptsubscript^𝜎GP2superscript𝑥subscript𝒟𝑛0\hat{\sigma}_{\text{GP}}^{2}(x^{\prime};\mathcal{D}_{n})\to 0over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → 0 as n𝑛n\to\inftyitalic_n → ∞.

Here σ^GP2(x;𝒟n)superscriptsubscript^𝜎GP2superscript𝑥subscript𝒟𝑛\hat{\sigma}_{\text{GP}}^{2}(x^{\prime};\mathcal{D}_{n})over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) denotes the posterior variance of the GP with data 𝒟nsubscript𝒟𝑛\mathcal{D}_{n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT collected. We are ready to prove Proposition 4.8:

Proof of Proposition 4.8.

Proposition 10 in [68] verifies the NEB property of GP, under the assumptions that 𝒳𝒳\mathcal{X}caligraphic_X is compact, the GP is centered, the kernel function 𝒦𝒦\mathcal{K}caligraphic_K of the GP is continuous and stationary, and its spectral density 𝒮𝒮\mathcal{S}caligraphic_S satisfying that 𝒮1superscript𝒮1\mathcal{S}^{-1}caligraphic_S start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is at most polynomial growth. We show our SNEB property below based on this result. Moreover, clearly it is equivalent to show the required properties for the posterior variance σ^GP2(x;𝒟n)superscriptsubscript^𝜎GP2superscript𝑥subscript𝒟𝑛\hat{\sigma}_{\text{GP}}^{2}(x^{\prime};\mathcal{D}_{n})over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) as the posterior standard deviation σ^GP(x;𝒟n)subscript^𝜎GPsuperscript𝑥subscript𝒟𝑛\hat{\sigma}_{\text{GP}}(x^{\prime};\mathcal{D}_{n})over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

To show part 1 of Assumption 3.4, suppose infnΔ(x;Sn)>0subscriptinfimum𝑛Δ𝑥subscript𝑆𝑛0\inf_{n}\Delta(x;S_{n})>0roman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x ; italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0 for an arbitrary finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X. This implies infnΔ(x;knSk)>0subscriptinfimum𝑛Δ𝑥subscript𝑘𝑛subscript𝑆𝑘0\inf_{n}\Delta(x;\cup_{k\leq n}S_{k})>0roman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x ; ∪ start_POSTSUBSCRIPT italic_k ≤ italic_n end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) > 0. Therefore, x𝑥xitalic_x cannot be an adherent point of knSksubscript𝑘𝑛subscript𝑆𝑘\cup_{k\leq n}S_{k}∪ start_POSTSUBSCRIPT italic_k ≤ italic_n end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The implication of Assumption 8.1 from statement 2 to statement 1 tells us σ^GP2(x;knSk)↛0↛superscriptsubscript^𝜎GP2superscript𝑥subscript𝑘𝑛subscript𝑆𝑘0\hat{\sigma}_{\text{GP}}^{2}(x^{\prime};\cup_{k\leq n}S_{k})\not\to 0over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; ∪ start_POSTSUBSCRIPT italic_k ≤ italic_n end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ↛ 0, and since σ^GP2(x;knSk)0superscriptsubscript^𝜎GP2superscript𝑥subscript𝑘𝑛subscript𝑆𝑘0\hat{\sigma}_{\text{GP}}^{2}(x^{\prime};\cup_{k\leq n}S_{k})\geq 0over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; ∪ start_POSTSUBSCRIPT italic_k ≤ italic_n end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≥ 0 and is monotonically non-increasing on n𝑛nitalic_n by the non-decreasing property of knSksubscript𝑘𝑛subscript𝑆𝑘\cup_{k\leq n}S_{k}∪ start_POSTSUBSCRIPT italic_k ≤ italic_n end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we must have lim infnσ^GP2(x;Sn)lim infnσ^GP2(x;knSk)>0subscriptlimit-infimum𝑛superscriptsubscript^𝜎GP2superscript𝑥subscript𝑆𝑛subscriptlimit-infimum𝑛superscriptsubscript^𝜎GP2superscript𝑥subscript𝑘𝑛subscript𝑆𝑘0\liminf_{n\to\infty}\hat{\sigma}_{\text{GP}}^{2}(x^{\prime};S_{n})\geq\liminf_% {n\to\infty}\hat{\sigma}_{\text{GP}}^{2}(x^{\prime};\cup_{k\leq n}S_{k})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; ∪ start_POSTSUBSCRIPT italic_k ≤ italic_n end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) > 0.

Denote the GP as ξ()𝜉\xi(\cdot)italic_ξ ( ⋅ ). To show part 2 of Assumption 3.4, suppose xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT converges to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for an arbitrary finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X. We have

σ^GP2(xn;An1𝒟n1)superscriptsubscript^𝜎GP2subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1\displaystyle\hat{\sigma}_{\text{GP}}^{2}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) (a)σ^GP2(xn;𝒟n1)superscript(a)absentsuperscriptsubscript^𝜎GP2subscript𝑥𝑛subscript𝒟𝑛1\displaystyle\stackrel{{\scriptstyle\textnormal{(a)}}}{{\mathstrut{\leq}}}\hat% {\sigma}_{\text{GP}}^{2}(x_{n};\mathcal{D}_{n-1})start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG (a) end_ARG end_RELOP over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT )
=(b)Var[ξ(xn)f^GP(xn;𝒟n1)]superscript(b)absentVar𝜉subscript𝑥𝑛subscript^𝑓GPsubscript𝑥𝑛subscript𝒟𝑛1\displaystyle\stackrel{{\scriptstyle\textnormal{(b)}}}{{\mathstrut{=}}}% \operatorname{Var}[\xi(x_{n})-\hat{f}_{\text{GP}}(x_{n};\mathcal{D}_{n-1})]start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG (b) end_ARG end_RELOP roman_Var [ italic_ξ ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ]
(c)Var[ξ(xn)ξ(xn1)]superscript(c)absentVar𝜉subscript𝑥𝑛𝜉subscript𝑥𝑛1\displaystyle\stackrel{{\scriptstyle\textnormal{(c)}}}{{\mathstrut{\leq}}}% \operatorname{Var}[\xi(x_{n})-\xi(x_{n-1})]start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG (c) end_ARG end_RELOP roman_Var [ italic_ξ ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_ξ ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ]
=(d)𝒦(xn,xn)2𝒦(xn,xn1)+𝒦(xn1,xn1)(e)0,superscript(d)absent𝒦subscript𝑥𝑛subscript𝑥𝑛2𝒦subscript𝑥𝑛subscript𝑥𝑛1𝒦subscript𝑥𝑛1subscript𝑥𝑛1superscript(e)0\displaystyle\stackrel{{\scriptstyle\textnormal{(d)}}}{{\mathstrut{=}}}% \mathcal{K}(x_{n},x_{n})-2\mathcal{K}(x_{n},x_{n-1})+\mathcal{K}(x_{n-1},x_{n-% 1})\stackrel{{\scriptstyle\textnormal{(e)}}}{{\mathstrut{\to}}}0,start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG (d) end_ARG end_RELOP caligraphic_K ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - 2 caligraphic_K ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) + caligraphic_K ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG (e) end_ARG end_RELOP 0 ,

where f^GPsubscript^𝑓GP\hat{f}_{\text{GP}}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT is the posterior mean of the GP, and inequality (8) follows from the non-increasing property of the posterior variance as more points are sampled; the equality (8) follows by the definition of variance; the inequality (8) follows by the fact that the posterior mean is the 2superscript2\mathcal{L}^{2}caligraphic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT best linear predictor; the inequality (8) follows by expanding out the variance; and the convergence of (8) follows by the continuity of 𝒦𝒦\mathcal{K}caligraphic_K and the convergence and hence Cauchyness of xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Proof of Proposition 4.9.

To verify part 1 of Assumption 3.4, suppose infnΔ(x;SnXn)>0subscriptinfimum𝑛Δ𝑥subscript𝑆𝑛subscript𝑋𝑛0\inf_{n}\Delta(x;S_{n}\cup X_{n})>0roman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x ; italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∪ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0. Then lim infnΔ(x;SnXn)infnΔ(x;SnXn)>0subscriptlimit-infimum𝑛Δ𝑥subscript𝑆𝑛subscript𝑋𝑛subscriptinfimum𝑛Δ𝑥subscript𝑆𝑛subscript𝑋𝑛0\liminf_{n\to\infty}\Delta(x;S_{n}\cup X_{n})\geq\inf_{n}\Delta(x;S_{n}\cup X_% {n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT roman_Δ ( italic_x ; italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∪ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ roman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x ; italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∪ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0.

To verify part 2 of Assumption 3.4, suppose xnxsubscript𝑥𝑛superscript𝑥x_{n}\to x^{\prime}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for some xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We have Δ(xn;Sn1Xn1)xnxn10Δsubscript𝑥𝑛subscript𝑆𝑛1subscript𝑋𝑛1normsubscript𝑥𝑛subscript𝑥𝑛10\Delta(x_{n};S_{n-1}\cup X_{n-1})\leq||x_{n}-x_{n-1}||\to 0roman_Δ ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_S start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ italic_X start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ≤ | | italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT | | → 0 since xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is Cauchy. ∎

Proof of Proposition 4.10.

Part 1 of Assumption 3.11 follows directly from the assumption of the proposition. To verify part 2, suppose {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } converges to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for an arbitrary finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X. From ϵitalic-ϵ\epsilonitalic_ϵ-relaxed local consistency of f^(r)superscript^𝑓𝑟\hat{f}^{(r)}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT and the continuity of r𝑟ritalic_r, we have lim supn|f^(r)(xn;An1𝒟n1)f(x)|ϵsubscriptlimit-supremum𝑛superscript^𝑓𝑟subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1𝑓superscript𝑥italic-ϵ\limsup_{n}|\hat{f}^{(r)}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})-f(x^{\prime})|\leq\epsilonlim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≤ italic_ϵ for any r𝑟ritalic_r almost surely by following the first part of the proof of Proposition 4.6. Then, since f^(r)superscript^𝑓𝑟\hat{f}^{(r)}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT is uniformly bounded, we have

lim supn(Varr[f^(r)(xn;An1𝒟n1)])1/2subscriptlimit-supremum𝑛superscript𝑉𝑎subscript𝑟similar-to𝑟delimited-[]superscript^𝑓𝑟subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛112\displaystyle\limsup_{n}(Var_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{n};A_{n-1}% \cup\mathcal{D}_{n-1})])^{\nicefrac{{1}}{{2}}}lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_V italic_a italic_r start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ] ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT lim supn(Er[f^(r)(xn;An1𝒟n1)2])1/2absentsubscriptlimit-supremum𝑛superscriptsubscript𝐸similar-to𝑟delimited-[]superscript^𝑓𝑟superscriptsubscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1212\displaystyle\leq\limsup_{n}(E_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{n};A_{n-1}% \cup\mathcal{D}_{n-1})^{2}])^{\nicefrac{{1}}{{2}}}≤ lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
(Er[lim supnf^(r)(xn;An1𝒟n1)2])1/2absentsuperscriptsubscript𝐸similar-to𝑟delimited-[]subscriptlimit-supremum𝑛superscript^𝑓𝑟superscriptsubscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1212\displaystyle\leq(E_{r\sim\mathcal{R}}[\limsup_{n}\hat{f}^{(r)}(x_{n};A_{n-1}% \cup\mathcal{D}_{n-1})^{2}])^{\nicefrac{{1}}{{2}}}≤ ( italic_E start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
ϵabsentitalic-ϵ\displaystyle\leq\epsilon≤ italic_ϵ

and so

((Varr[f^(r)(xn;An1𝒟n1)])1/2ϵ)+0subscriptsuperscript𝑉𝑎subscript𝑟similar-to𝑟delimited-[]superscript^𝑓𝑟subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛112italic-ϵ0((Var_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})])^{% \nicefrac{{1}}{{2}}}-\epsilon)_{+}\to 0( ( italic_V italic_a italic_r start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ] ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT - italic_ϵ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT → 0

Proof of Corollary 4.11.

To apply Proposition 4.10, we verify the assumptions in Proposition 4.6 for the kernel regression SP, with ϵ=(L+L~)Mhitalic-ϵ𝐿~𝐿𝑀\epsilon=(L+\tilde{L})Mhitalic_ϵ = ( italic_L + over~ start_ARG italic_L end_ARG ) italic_M italic_h. To this end, note that with the assumptions in Proposition 4.5 and the additional assumptions in Corollary 4.11, fr𝑓𝑟f-ritalic_f - italic_r is an (L+L~)𝐿~𝐿(L+\tilde{L})( italic_L + over~ start_ARG italic_L end_ARG )-Lipschitz function. Thus, together with the continuity of r𝑟ritalic_r, f^(r)superscript^𝑓𝑟\hat{f}^{(r)}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT is (L+L~)Mh𝐿~𝐿𝑀(L+\tilde{L})Mh( italic_L + over~ start_ARG italic_L end_ARG ) italic_M italic_h-relaxed locally consistent for almost surely any r𝑟ritalic_r. Moreover, the uniform boundedness of r𝑟ritalic_r and f𝑓fitalic_f also implies the same property for f^(r)superscript^𝑓𝑟\hat{f}^{(r)}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT driven by the kernel regression SP. These verify all the assumptions in Proposition 4.6.

Next, note that when infnΔ(x0;𝒟n)>Mhsubscriptinfimum𝑛Δsubscript𝑥0subscript𝒟𝑛𝑀\inf_{n}\Delta(x_{0};\mathcal{D}_{n})>Mhroman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_M italic_h, we have f^Ker(x0;𝒟n(r))subscript^𝑓Kersubscript𝑥0superscriptsubscript𝒟𝑛𝑟\hat{f}_{\text{Ker}}(x_{0};\mathcal{D}_{n}^{(r)})over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT Ker end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ) equal to a prefixed constant by the construction of the kernel regression SP. In this case, f^(r)(x0;𝒟n)=r(x0)superscript^𝑓𝑟subscript𝑥0subscript𝒟𝑛𝑟subscript𝑥0\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})=r(x_{0})over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_r ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). So, by assumption we have (Varr[f^(r)(x0;𝒟n)])1/2=(Varr[r(x0)])1/2>(L+L~)MhsuperscriptsubscriptVarsimilar-to𝑟superscript^𝑓𝑟subscript𝑥0subscript𝒟𝑛12superscriptsubscriptVarsimilar-to𝑟𝑟subscript𝑥012𝐿~𝐿𝑀(\operatorname{Var}_{r\sim\mathcal{R}}[\hat{f}^{(r)}(x_{0};\mathcal{D}_{n})])^% {\nicefrac{{1}}{{2}}}=(\operatorname{Var}_{r\sim\mathcal{R}}[r(x_{0})])^{% \nicefrac{{1}}{{2}}}>(L+\tilde{L})Mh( roman_Var start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT = ( roman_Var start_POSTSUBSCRIPT italic_r ∼ caligraphic_R end_POSTSUBSCRIPT [ italic_r ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT > ( italic_L + over~ start_ARG italic_L end_ARG ) italic_M italic_h. This verifies the last assumption in Proposition 4.10 with δ=Mh𝛿𝑀\delta=Mhitalic_δ = italic_M italic_h. Hence we conclude the corollary. ∎

Proof of Proposition 4.12.

Given a class of UQs Σ^:={σ1^,σ2^,}assign^Σ^subscript𝜎1^subscript𝜎2\hat{\Sigma}:=\{\hat{\sigma_{1}},\hat{\sigma_{2}},...\}over^ start_ARG roman_Σ end_ARG := { over^ start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , over^ start_ARG italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , … } with the δ𝛿\deltaitalic_δ-relaxed SNEB property, to show part 1 of Assumption 3.11, suppose infnΔ(x;Sn)>δsubscriptinfimum𝑛Δ𝑥subscript𝑆𝑛𝛿\inf_{n}\Delta(x;S_{n})>\deltaroman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ ( italic_x ; italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_δ, with An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for an arbitrary finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X. For any convex combination of Σ^^Σ\hat{\Sigma}over^ start_ARG roman_Σ end_ARG,

lim infni=1|Σ^|αiσ^i(x;An1𝒟n1)=i=1|Σ^|αilim infnσ^i(x;An1𝒟n1)>0,subscriptlimit-infimum𝑛superscriptsubscript𝑖1^Σsubscript𝛼𝑖subscript^𝜎𝑖𝑥subscript𝐴𝑛1subscript𝒟𝑛1superscriptsubscript𝑖1^Σsubscript𝛼𝑖subscriptlimit-infimum𝑛subscript^𝜎𝑖𝑥subscript𝐴𝑛1subscript𝒟𝑛10\liminf_{n\to\infty}\sum_{i=1}^{|\hat{\Sigma}|}\alpha_{i}\hat{\sigma}_{i}(x;A_% {n-1}\cup\mathcal{D}_{n-1})=\sum_{i=1}^{|\hat{\Sigma}|}\alpha_{i}\liminf_{n\to% \infty}\hat{\sigma}_{i}(x;A_{n-1}\cup\mathcal{D}_{n-1})>0,lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | over^ start_ARG roman_Σ end_ARG | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | over^ start_ARG roman_Σ end_ARG | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) > 0 ,

where the last inequality is inherited from part 1 of Assumption 3.11 for the individual UQs and that at least one of αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s is positive since they sum up to 1.

To show part 2 of Assumption 3.11, suppose {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } converges to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and An=Ef(Sn)subscript𝐴𝑛subscript𝐸𝑓subscript𝑆𝑛A_{n}=E_{f}(S_{n})italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for an arbitrary finite-cardinality set sequence Sn𝒳subscript𝑆𝑛𝒳S_{n}\subset\mathcal{X}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X. For any convex combination of Σ^^Σ\hat{\Sigma}over^ start_ARG roman_Σ end_ARG,

limni=1|Σ^|αiσ^i(xn;An1𝒟n1)=i=1|Σ^|αilimnσ^i(xn;An1𝒟n1)=0.subscript𝑛superscriptsubscript𝑖1^Σsubscript𝛼𝑖subscript^𝜎𝑖subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛1superscriptsubscript𝑖1^Σsubscript𝛼𝑖subscript𝑛subscript^𝜎𝑖subscript𝑥𝑛subscript𝐴𝑛1subscript𝒟𝑛10\lim_{n\to\infty}\sum_{i=1}^{|\hat{\Sigma}|}\alpha_{i}\hat{\sigma}_{i}(x_{n};A% _{n-1}\cup\mathcal{D}_{n-1})=\sum_{i=1}^{|\hat{\Sigma}|}\alpha_{i}\lim_{n\to% \infty}\hat{\sigma}_{i}(x_{n};A_{n-1}\cup\mathcal{D}_{n-1})=0.roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | over^ start_ARG roman_Σ end_ARG | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | over^ start_ARG roman_Σ end_ARG | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) = 0 .

Proof of Proposition 4.13.

To verify part 1 of Assumption 3.5, suppose lim infnpn>subscriptlimit-infimum𝑛subscript𝑝𝑛\liminf_{n\to\infty}p_{n}>-\inftylim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > - ∞ and lim infnqn>0subscriptlimit-infimum𝑛subscript𝑞𝑛0\liminf_{n\to\infty}q_{n}>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > 0. Then, for nN𝑛𝑁n\geq Nitalic_n ≥ italic_N for some large N𝑁Nitalic_N, we have pnc1>subscript𝑝𝑛subscript𝑐1p_{n}\geq c_{1}>-\inftyitalic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > - ∞ and qnc2>0subscript𝑞𝑛subscript𝑐20q_{n}\geq c_{2}>0italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 for some c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and so gnPI(pn,qn)infnNΦ((c1τ)/qn)superscriptsubscript𝑔𝑛PIsubscript𝑝𝑛subscript𝑞𝑛subscriptinfimum𝑛𝑁Φsubscript𝑐1𝜏subscript𝑞𝑛g_{n}^{\text{PI}}(p_{n},q_{n})\geq\inf_{n\geq N}\Phi((c_{1}-\tau)/q_{n})italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT PI end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ roman_inf start_POSTSUBSCRIPT italic_n ≥ italic_N end_POSTSUBSCRIPT roman_Φ ( ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ ) / italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) which is at least Φ((c1τ)/c2)Φsubscript𝑐1𝜏subscript𝑐2\Phi((c_{1}-\tau)/c_{2})roman_Φ ( ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ ) / italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) if c1τ<0subscript𝑐1𝜏0c_{1}-\tau<0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ < 0, and 1/2121/21 / 2 if c1τ0subscript𝑐1𝜏0c_{1}-\tau\geq 0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ ≥ 0. Thus, lim infngnPI(pn,qn)>0subscriptlimit-infimum𝑛superscriptsubscript𝑔𝑛PIsubscript𝑝𝑛subscript𝑞𝑛0\liminf_{n\to\infty}g_{n}^{\text{PI}}(p_{n},q_{n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT PI end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0.

To verify part 2 of Assumption 3.5, suppose lim supnpn0subscriptlimit-supremum𝑛subscript𝑝𝑛0\limsup_{n\to\infty}p_{n}\leq 0lim sup start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ 0 and qn0subscript𝑞𝑛0q_{n}\to 0italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → 0. Then lim supnpnτ<0subscriptlimit-supremum𝑛subscript𝑝𝑛𝜏0\limsup_{n\to\infty}p_{n}-\tau<0lim sup start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_τ < 0 since τ>0𝜏0\tau>0italic_τ > 0 and thus eventually pnτc<0subscript𝑝𝑛𝜏𝑐0p_{n}-\tau\leq c<0italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_τ ≤ italic_c < 0 for some c𝑐citalic_c, giving gnPI(pn,qn)0superscriptsubscript𝑔𝑛PIsubscript𝑝𝑛subscript𝑞𝑛0g_{n}^{\text{PI}}(p_{n},q_{n})\to 0italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT PI end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → 0 by directly using (12). ∎

Proof of Proposition 4.14.

To verify part 1 of Assumption 3.5, suppose lim infnpn>subscriptlimit-infimum𝑛subscript𝑝𝑛\liminf_{n\to\infty}p_{n}>-\inftylim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > - ∞ and lim infnqn>0subscriptlimit-infimum𝑛subscript𝑞𝑛0\liminf_{n\to\infty}q_{n}>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > 0. Then, for nN𝑛𝑁n\geq Nitalic_n ≥ italic_N for a large enough N𝑁Nitalic_N, we have pnc1>subscript𝑝𝑛subscript𝑐1p_{n}\geq c_{1}>-\inftyitalic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > - ∞ and qnc2>0subscript𝑞𝑛subscript𝑐20q_{n}\geq c_{2}>0italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 for some c1,c2subscript𝑐1subscript𝑐2c_{1},c_{2}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Note that gnEI(pn,qn)=E(N(pn,qn)τ)+superscriptsubscript𝑔𝑛EIsubscript𝑝𝑛subscript𝑞𝑛𝐸subscript𝑁subscript𝑝𝑛subscript𝑞𝑛𝜏g_{n}^{\text{EI}}(p_{n},q_{n})=E(N(p_{n},q_{n})-\tau)_{+}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT EI end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_E ( italic_N ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_τ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, where N(pn,qn)𝑁subscript𝑝𝑛subscript𝑞𝑛N(p_{n},q_{n})italic_N ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) denotes a normal variable with mean pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and standard deviation qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Suppose nN𝑛𝑁n\geq Nitalic_n ≥ italic_N. We have N(pn,qn)𝑁subscript𝑝𝑛subscript𝑞𝑛N(p_{n},q_{n})italic_N ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) stochastically dominates N(c1,qn)𝑁subscript𝑐1subscript𝑞𝑛N(c_{1},q_{n})italic_N ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), and thus E(N(pn,qn)τ)+E(N(c1,qn)τ)+𝐸subscript𝑁subscript𝑝𝑛subscript𝑞𝑛𝜏𝐸subscript𝑁subscript𝑐1subscript𝑞𝑛𝜏E(N(p_{n},q_{n})-\tau)_{+}\geq E(N(c_{1},q_{n})-\tau)_{+}italic_E ( italic_N ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_τ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ≥ italic_E ( italic_N ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_τ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT since (τ)+(\cdot-\tau)_{+}( ⋅ - italic_τ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is a non-decreasing. Moreover, we have N(c1,qn)𝑁subscript𝑐1subscript𝑞𝑛N(c_{1},q_{n})italic_N ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) second-order stochastically dominates N(c1,c2)𝑁subscript𝑐1subscript𝑐2N(c_{1},c_{2})italic_N ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), and thus E(N(c1,qn)τ)+E(N(c1,c2)τ)+𝐸subscript𝑁subscript𝑐1subscript𝑞𝑛𝜏𝐸subscript𝑁subscript𝑐1subscript𝑐2𝜏E(N(c_{1},q_{n})-\tau)_{+}\geq E(N(c_{1},c_{2})-\tau)_{+}italic_E ( italic_N ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_τ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ≥ italic_E ( italic_N ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_τ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT since (τ)+(\cdot-\tau)_{+}( ⋅ - italic_τ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is non-decreasing and convex. Hence gnEI(pn,qn)gnEI(c1,c2)>0superscriptsubscript𝑔𝑛EIsubscript𝑝𝑛subscript𝑞𝑛superscriptsubscript𝑔𝑛EIsubscript𝑐1subscript𝑐20g_{n}^{\text{EI}}(p_{n},q_{n})\geq g_{n}^{\text{EI}}(c_{1},c_{2})>0italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT EI end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT EI end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) > 0. This gives lim infngn(pn,qn)>0subscriptlimit-infimum𝑛subscript𝑔𝑛subscript𝑝𝑛subscript𝑞𝑛0\liminf_{n\to\infty}g_{n}(p_{n},q_{n})>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0.

To verify part 2 of Assumption 3.5, suppose lim supnpn0subscriptlimit-supremum𝑛subscript𝑝𝑛0\limsup_{n\to\infty}p_{n}\leq 0lim sup start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ 0 and qn0subscript𝑞𝑛0q_{n}\to 0italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → 0. Then gnEI(pn,qn)0superscriptsubscript𝑔𝑛EIsubscript𝑝𝑛subscript𝑞𝑛0g_{n}^{\text{EI}}(p_{n},q_{n})\to 0italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT EI end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → 0 by directly using (12) and noting that the function is continuous even at qn=0subscript𝑞𝑛0q_{n}=0italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0. ∎

Proof of Proposition 4.15.

Since τ𝜏\tauitalic_τ is fixed, βnsubscript𝛽𝑛\beta_{n}\to\inftyitalic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ and pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is bounded, pnτβn0subscript𝑝𝑛𝜏subscript𝛽𝑛0\frac{p_{n}-\tau}{\beta_{n}}\to 0divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_τ end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG → 0. To verify part 1 of Assumption 3.5, supposing lim infnqn>0subscriptlimit-infimum𝑛subscript𝑞𝑛0\liminf_{n\to\infty}q_{n}>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > 0, we have lim infngnUCB(pn,qn)=lim infn{pnτβn+qn}=lim infnqn>0subscriptlimit-infimum𝑛superscriptsubscript𝑔𝑛UCBsubscript𝑝𝑛subscript𝑞𝑛subscriptlimit-infimum𝑛subscript𝑝𝑛𝜏subscript𝛽𝑛subscript𝑞𝑛subscriptlimit-infimum𝑛subscript𝑞𝑛0\liminf_{n\to\infty}g_{n}^{\text{UCB}}(p_{n},q_{n})=\liminf_{n\to\infty}\{% \frac{p_{n}-\tau}{\beta_{n}}+q_{n}\}=\liminf_{n\to\infty}q_{n}>0lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT UCB end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT { divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_τ end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG + italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } = lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > 0. To verify part 2, supposing qn0subscript𝑞𝑛0q_{n}\to 0italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → 0, we have gnUCB(pn,qn)=pnτβn+qn0superscriptsubscript𝑔𝑛UCBsubscript𝑝𝑛subscript𝑞𝑛subscript𝑝𝑛𝜏subscript𝛽𝑛subscript𝑞𝑛0g_{n}^{\text{UCB}}(p_{n},q_{n})=\frac{p_{n}-\tau}{\beta_{n}}+q_{n}\to 0italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT UCB end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_τ end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG + italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → 0. ∎

Proof of Proposition 4.16.

Both parts of Assumption 3.5 can be straightforwardly shown to preserve under convex combinations. ∎

9 Additional Experiment Details

9.1 Implementation Details

The methods we compare with are listed in Table 3.

Methods Software library URL
RS Hyperopt https://github.com/hyperopt/hyperopt
BO BoTorch https://botorch.org
TuRBO - https://github.com/uber-research/TuRBO
TPE Hyperopt https://github.com/hyperopt/hyperopt
BORE Syne Tune https://github.com/awslabs/syne-tune
LFBO - https://github.com/lfbo-ml/lfbo
Table 3: Package information.

In particular, for standard BO, we use Matérn Kernel with the default hyperparameter values in BoTorch. For BORE and LFBO, we use XGBoost as the classifier with preset parameters in its original implementations, since LFBO and BORE with XGBoost has relatively good and stable performances across tasks and are computationally much faster than other classifiers (e.g., random forest or neural network).

All experiments are conducted on a computer with a 4-core Intel(R) Xeon(R) CPU @ 2.30GHz and a Tesla T4 GPU.

9.1.1 Hyperparameters for PseudoBO Methods

Randomized prior functions.   The random functions in all tasks are sampled from a random 3-layer neural network r(x)=W3Tanh(W2Tanh(W1x+b1)+b2)+b3𝑟𝑥subscript𝑊3Tanhsubscript𝑊2Tanhsubscript𝑊1𝑥subscript𝑏1subscript𝑏2subscript𝑏3r(x)=W_{3}\text{Tanh}(W_{2}\text{Tanh}(W_{1}x+b_{1})+b_{2})+b_{3}italic_r ( italic_x ) = italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT Tanh ( italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Tanh ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, with the Glorot random initialization [24].

Kernel.   We use Gaussian kernel throughout all models and all tasks.

Bandwidths.   For local kernel regression, we configure the bandwidth as hn(x)=(1eΔ(x,Xn)n)(hn(u)hn(l))+hn(l)subscript𝑛𝑥1superscript𝑒Δ𝑥subscript𝑋𝑛𝑛superscriptsubscript𝑛𝑢superscriptsubscript𝑛𝑙superscriptsubscript𝑛𝑙h_{n}(x)=(1-e^{-\Delta(x,X_{n})\cdot n})\cdot(h_{n}^{(u)}-h_{n}^{(l)})+h_{n}^{% (l)}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) = ( 1 - italic_e start_POSTSUPERSCRIPT - roman_Δ ( italic_x , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⋅ italic_n end_POSTSUPERSCRIPT ) ⋅ ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT - italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) + italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, where h0(u)h0(l)>0superscriptsubscript0𝑢superscriptsubscript0𝑙0h_{0}^{(u)}\geq h_{0}^{(l)}>0italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT ≥ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT > 0 are defined by hn(l)=h0(l)n12+dsuperscriptsubscript𝑛𝑙superscriptsubscript0𝑙superscript𝑛12𝑑h_{n}^{(l)}=h_{0}^{(l)}\cdot n^{-\frac{1}{2+d}}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ⋅ italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 + italic_d end_ARG end_POSTSUPERSCRIPT, hn(u)=h0(u)n12+dsuperscriptsubscript𝑛𝑢superscriptsubscript0𝑢superscript𝑛12𝑑h_{n}^{(u)}=h_{0}^{(u)}\cdot n^{-\frac{1}{2+d}}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT ⋅ italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 + italic_d end_ARG end_POSTSUPERSCRIPT. For the randomized prior method, we adopt hn=h0n12+dsuperscriptsubscript𝑛superscriptsubscript0superscript𝑛12𝑑h_{n}^{\prime}=h_{0}^{\prime}\cdot n^{-\frac{1}{2+d}}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 + italic_d end_ARG end_POSTSUPERSCRIPT. From our empirical investigation, choosing h0(l)superscriptsubscript0𝑙h_{0}^{(l)}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, h0(u)superscriptsubscript0𝑢h_{0}^{(u)}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT and h0superscriptsubscript0h_{0}^{\prime}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to be small helps the performance of the PseudoBO models. Moreover, note that we have selected the bandwidths that depend on n𝑛nitalic_n. This dependence on n𝑛nitalic_n is negligible, and our theory for (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-relaxed algorithmic consistency guarantee recommends using a fixed bandwidth. However, we choose this light dependence on n𝑛nitalic_n to match suggestions from standard minimax results in nonparametric statistics [26]. Typically, supposing 𝒳=[0,1]d𝒳superscript01𝑑\mathcal{X}=[0,1]^{d}caligraphic_X = [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, choosing h0(l)[0.075,0.1]superscriptsubscript0𝑙0.0750.1h_{0}^{(l)}\in[0.075,0.1]italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ [ 0.075 , 0.1 ], h0(u)[0.2,0.4]superscriptsubscript0𝑢0.20.4h_{0}^{(u)}\in[0.2,0.4]italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT ∈ [ 0.2 , 0.4 ], and h0=0.005superscriptsubscript00.005h_{0}^{\prime}=0.005italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0.005 (in hybrid randomized prior UQ) would produce a relatively promising performance.

Specifically, PseudoBO-RP uses h0,i=0.075(𝒳i,1𝒳i,0)superscriptsubscript0𝑖0.075subscript𝒳𝑖1subscript𝒳𝑖0h_{0,i}^{\prime}=0.075(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})italic_h start_POSTSUBSCRIPT 0 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0.075 ( caligraphic_X start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT - caligraphic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ) for synthetic functions optimization; h0=0.1×[1/2,1/2,1/2,1/2,1/4,1/3,1/3,1/6,1/2,1/2,1/6,1/6]superscriptsubscript00.1121212121413131612121616h_{0}^{\prime}=0.1\times[1/2,1/2,1/2,1/2,1/4,1/3,1/3,1/6,1/2,1/2,1/6,1/6]italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0.1 × [ 1 / 2 , 1 / 2 , 1 / 2 , 1 / 2 , 1 / 4 , 1 / 3 , 1 / 3 , 1 / 6 , 1 / 2 , 1 / 2 , 1 / 6 , 1 / 6 ] for neural network tuning, where the denominators in the vector represents the number of candidates in each hyperparameter (see Section 9.3 for further details); h0,i=0.1(𝒳i,1𝒳i,0)superscriptsubscript0𝑖0.1subscript𝒳𝑖1subscript𝒳𝑖0h_{0,i}^{\prime}=0.1(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})italic_h start_POSTSUBSCRIPT 0 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0.1 ( caligraphic_X start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT - caligraphic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ) for both RL tasks.

PseudoBO-KR-Hyb/PseudoBO-KR-Hyb-TR uses h0=0.01×[1/2,1/2,1/2,1/2,1/4,/3,1/3,1/6,h_{0}^{\prime}=0.01\times[1/2,1/2,1/2,1/2,1/4,/3,1/3,1/6,italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0.01 × [ 1 / 2 , 1 / 2 , 1 / 2 , 1 / 2 , 1 / 4 , / 3 , 1 / 3 , 1 / 6 , 1/2,1/2,1/6,1/6]1/2,1/2,1/6,1/6]1 / 2 , 1 / 2 , 1 / 6 , 1 / 6 ] in neural network tuning, and h0,i=0.005(𝒳i,1𝒳i,0)superscriptsubscript0𝑖0.005subscript𝒳𝑖1subscript𝒳𝑖0h_{0,i}^{\prime}=0.005(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})italic_h start_POSTSUBSCRIPT 0 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0.005 ( caligraphic_X start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT - caligraphic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ) in all other tasks for UQ. For SP, they use h0,i(l)=0.05(𝒳i,1𝒳i,0)superscriptsubscript0𝑖𝑙0.05subscript𝒳𝑖1subscript𝒳𝑖0h_{0,i}^{(l)}=0.05(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})italic_h start_POSTSUBSCRIPT 0 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = 0.05 ( caligraphic_X start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT - caligraphic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ), h0,i(u)=0.2(𝒳i,1𝒳i,0)superscriptsubscript0𝑖𝑢0.2subscript𝒳𝑖1subscript𝒳𝑖0h_{0,i}^{(u)}=0.2(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})italic_h start_POSTSUBSCRIPT 0 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT = 0.2 ( caligraphic_X start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT - caligraphic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ) in synthetic functions optimization; h0,l=0.2×[1/2,1/2,1/2,1/2,1/4,/3,1/3,1/6,1/2,1/2,1/6,1/6]h_{0,l}=0.2\times[1/2,1/2,1/2,1/2,1/4,/3,1/3,1/6,1/2,1/2,1/6,1/6]italic_h start_POSTSUBSCRIPT 0 , italic_l end_POSTSUBSCRIPT = 0.2 × [ 1 / 2 , 1 / 2 , 1 / 2 , 1 / 2 , 1 / 4 , / 3 , 1 / 3 , 1 / 6 , 1 / 2 , 1 / 2 , 1 / 6 , 1 / 6 ], h0,u=0.8×[1/2,1/2,1/2,1/2,1/4,/3,1/3,1/6,1/2,1/2,1/6,1/6]h_{0,u}=0.8\times[1/2,1/2,1/2,1/2,1/4,/3,1/3,1/6,1/2,1/2,1/6,1/6]italic_h start_POSTSUBSCRIPT 0 , italic_u end_POSTSUBSCRIPT = 0.8 × [ 1 / 2 , 1 / 2 , 1 / 2 , 1 / 2 , 1 / 4 , / 3 , 1 / 3 , 1 / 6 , 1 / 2 , 1 / 2 , 1 / 6 , 1 / 6 ] in neural network tuning; h0,i(l)=0.1(𝒳i,1𝒳i,0)superscriptsubscript0𝑖𝑙0.1subscript𝒳𝑖1subscript𝒳𝑖0h_{0,i}^{(l)}=0.1(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})italic_h start_POSTSUBSCRIPT 0 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = 0.1 ( caligraphic_X start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT - caligraphic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ), h0,i(u)=0.325(𝒳i,1𝒳i,0)superscriptsubscript0𝑖𝑢0.325subscript𝒳𝑖1subscript𝒳𝑖0h_{0,i}^{(u)}=0.325(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})italic_h start_POSTSUBSCRIPT 0 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT = 0.325 ( caligraphic_X start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT - caligraphic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ) in robot pushing; h0,i(l)=0.1(𝒳i,1𝒳i,0)superscriptsubscript0𝑖𝑙0.1subscript𝒳𝑖1subscript𝒳𝑖0h_{0,i}^{(l)}=0.1(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})italic_h start_POSTSUBSCRIPT 0 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = 0.1 ( caligraphic_X start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT - caligraphic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ), h0,i(u)=0.375(𝒳i,1𝒳i,0)superscriptsubscript0𝑖𝑢0.375subscript𝒳𝑖1subscript𝒳𝑖0h_{0,i}^{(u)}=0.375(\mathcal{X}_{i,1}-\mathcal{X}_{i,0})italic_h start_POSTSUBSCRIPT 0 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT = 0.375 ( caligraphic_X start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT - caligraphic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ) in rover trajectory planning.

Perturbation probability of Sobol sequence.   As in TuRBO [20], we use the Sobol sequence for inner optimization of all PseudoBO methods, with perturbing probability of 1111 in the 2D synthetic function tasks, 0.750.750.750.75 in the 6D synthetic function task, 0.50.50.50.5 in the 10D synthetic function task, 0.40.40.40.4 in the 12D neural network tuning task, 0.350.350.350.35 in the 14D robot push task, and 0.150.150.150.15 in the 60D rover trajectory task.

Robust UQ.   In the task of neural network hyperparameter tuning, we observe that the collected query data bears a large variability, resulting in bad UQ estimates. Therefore, we perform winsorization on the data to ensure the UQ to work robustly. In particular, for all the data with objectives smaller than q35(q3q1)subscript𝑞35subscript𝑞3subscript𝑞1q_{3}-5(q_{3}-q_{1})italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - 5 ( italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), we replace them with q35(q3q1)subscript𝑞35subscript𝑞3subscript𝑞1q_{3}-5(q_{3}-q_{1})italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - 5 ( italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), where q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and q3subscript𝑞3q_{3}italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT represents the first and third quartiles of the data.

9.2 Calibrated Coverage Rate

We employ our proposed CCR criterion to assess how well the UQs of the considered methods, GP, NN+MD, RP and KR+Hyb, are calibrated. We generate the training set 𝒟trainsubscript𝒟train\mathcal{D}_{\text{train}}caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT, the validation set 𝒟valsubscript𝒟val\mathcal{D}_{\text{val}}caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT and the test set 𝒟trainsubscript𝒟train\mathcal{D}_{\text{train}}caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT by uniformly sampling 20202020, 10101010 and 150150150150 points from the decision space at random, with their labels evaluated by the black-box function.

We have purposefully designed the sizes of 𝒟trainsubscript𝒟train\mathcal{D}_{\text{train}}caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT, 𝒟valsubscript𝒟val\mathcal{D}_{\text{val}}caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT, and 𝒟testsubscript𝒟test\mathcal{D}_{\text{test}}caligraphic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT. The size of the validation set is smaller than the size of training set so that the calibration over the validation set does not disclose excessive information about the unknown function’s shape. Consequently, the quality of calibration depends on a combination of learning from the training set and the supplementary information from the validation set. Moreover, the size of test set is much larger than both for the purpose of a more accurate evaluation of the true performance of each UQ method.

To find λval=minλ0λsubscript𝜆valsubscript𝜆0𝜆\lambda_{\text{val}}=\min_{\lambda\geq 0}\lambdaitalic_λ start_POSTSUBSCRIPT val end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_λ ≥ 0 end_POSTSUBSCRIPT italic_λ such that

(x,y)𝒟val(x[f^(x;𝒟train)λσ^(x;𝒟train),f^(x;𝒟train)+λσ^(x;𝒟train)])=1,subscriptsimilar-to𝑥𝑦subscript𝒟val𝑥^𝑓𝑥subscript𝒟train𝜆^𝜎𝑥subscript𝒟train^𝑓𝑥subscript𝒟train𝜆^𝜎𝑥subscript𝒟train1\mathds{P}_{(x,y)\sim\mathcal{D}_{\text{val}}}(x\in[\hat{f}(x;\mathcal{D}_{% \text{train}})-\lambda\hat{\sigma}(x;\mathcal{D}_{\text{train}}),\hat{f}(x;% \mathcal{D}_{\text{train}})+\lambda\hat{\sigma}(x;\mathcal{D}_{\text{train}})]% )=1,blackboard_P start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ∈ [ over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) - italic_λ over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) , over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) + italic_λ over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) ] ) = 1 ,

we use the bisection approach in Algorithm 1:

 Input: Pretrained SP f^(;𝒟train)^𝑓subscript𝒟train\hat{f}(\cdot;\mathcal{D}_{\text{train}})over^ start_ARG italic_f end_ARG ( ⋅ ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) , pretrained UQ σ^(;𝒟train)^𝜎subscript𝒟train\hat{\sigma}(\cdot;\mathcal{D}_{\text{train}})over^ start_ARG italic_σ end_ARG ( ⋅ ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ), validation set 𝒟valsubscript𝒟val\mathcal{D}_{\text{val}}caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT, and tolerance level ϵitalic-ϵ\epsilonitalic_ϵ;
Output: Multiplier λvalsubscript𝜆val\lambda_{\text{val}}italic_λ start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ;
Initialize λl0subscript𝜆𝑙0\lambda_{l}\leftarrow 0italic_λ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ← 0, λinit1subscript𝜆init1\lambda_{\text{init}}\leftarrow 1italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT ← 1, λusubscript𝜆𝑢\lambda_{u}\leftarrow\inftyitalic_λ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ← ∞.
while λu=subscript𝜆𝑢\lambda_{u}=\inftyitalic_λ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = ∞ do
       Compute CRinit(x,y)𝒟val(x[f^(x;𝒟train)λinitσ^(x;𝒟train),f^(x;𝒟train)+λinitσ^(x;𝒟train)])𝐶subscript𝑅initsubscriptsimilar-to𝑥𝑦subscript𝒟val𝑥^𝑓𝑥subscript𝒟trainsubscript𝜆init^𝜎𝑥subscript𝒟train^𝑓𝑥subscript𝒟trainsubscript𝜆init^𝜎𝑥subscript𝒟trainCR_{\text{init}}\leftarrow\mathds{P}_{(x,y)\sim\mathcal{D}_{\text{val}}}(x\in[% \hat{f}(x;\mathcal{D}_{\text{train}})-\lambda_{\text{init}}\hat{\sigma}(x;% \mathcal{D}_{\text{train}}),\hat{f}(x;\mathcal{D}_{\text{train}})+\lambda_{% \text{init}}\hat{\sigma}(x;\mathcal{D}_{\text{train}})])italic_C italic_R start_POSTSUBSCRIPT init end_POSTSUBSCRIPT ← blackboard_P start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ∈ [ over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) - italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) , over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) ] ).
      if CRinit<1𝐶subscript𝑅init1CR_{\text{init}}<1italic_C italic_R start_POSTSUBSCRIPT init end_POSTSUBSCRIPT < 1 then
             λinit2λinitsubscript𝜆init2subscript𝜆init\lambda_{\text{init}}\leftarrow 2\cdot\lambda_{\text{init}}italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT ← 2 ⋅ italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT.
      else
             λuλinitsubscript𝜆𝑢subscript𝜆init\lambda_{u}\leftarrow\lambda_{\text{init}}italic_λ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ← italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT.
       end if
      
end while
while λuλl>ϵsubscript𝜆𝑢subscript𝜆𝑙italic-ϵ\lambda_{u}-\lambda_{l}>\epsilonitalic_λ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT > italic_ϵ do
       λval(λl+λu)/2subscript𝜆valsubscript𝜆lsubscript𝜆u2\lambda_{\text{val}}\leftarrow(\lambda_{\text{l}}+\lambda_{\text{u}})/2italic_λ start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ← ( italic_λ start_POSTSUBSCRIPT l end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT u end_POSTSUBSCRIPT ) / 2
      Compute CRval(x,y)𝒟val(x[f^(x;𝒟train)λvalσ^(x;𝒟train),f^(x;𝒟train)+λinitσ^(x;𝒟train)])𝐶subscript𝑅valsubscriptsimilar-to𝑥𝑦subscript𝒟val𝑥^𝑓𝑥subscript𝒟trainsubscript𝜆val^𝜎𝑥subscript𝒟train^𝑓𝑥subscript𝒟trainsubscript𝜆init^𝜎𝑥subscript𝒟trainCR_{\text{val}}\leftarrow\mathds{P}_{(x,y)\sim\mathcal{D}_{\text{val}}}(x\in[% \hat{f}(x;\mathcal{D}_{\text{train}})-\lambda_{\text{val}}\hat{\sigma}(x;% \mathcal{D}_{\text{train}}),\hat{f}(x;\mathcal{D}_{\text{train}})+\lambda_{% \text{init}}\hat{\sigma}(x;\mathcal{D}_{\text{train}})])italic_C italic_R start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ← blackboard_P start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT val end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ∈ [ over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) - italic_λ start_POSTSUBSCRIPT val end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) , over^ start_ARG italic_f end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG ( italic_x ; caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) ] ).
      if CRval<1𝐶subscript𝑅val1CR_{\text{val}}<1italic_C italic_R start_POSTSUBSCRIPT val end_POSTSUBSCRIPT < 1 then
             λlλvalsubscript𝜆𝑙subscript𝜆val\lambda_{l}\leftarrow\lambda_{\text{val}}italic_λ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ← italic_λ start_POSTSUBSCRIPT val end_POSTSUBSCRIPT.
      else
             λuλvalsubscript𝜆𝑢subscript𝜆val\lambda_{u}\leftarrow\lambda_{\text{val}}italic_λ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ← italic_λ start_POSTSUBSCRIPT val end_POSTSUBSCRIPT.
       end if
      
end while
Algorithm 1 Pre-trained SP and UQ Combo Calibration

Additional sample runs on the objective f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are shown in Fig. 9 and 10.

Refer to caption
Fig. 9: A sample run of GP, NN+MD, RP, KR+Hybrid (from left to right). GP has CCR 0.57 and width 1.47, NN + MD has CCR 0.77 and width 3.06, RP has CCR 0.91 and width 71.50, and KR + Hybrid has CCR 0.86 and width 3.26.
Refer to caption
Fig. 10: A sample run of GP, NN+MD, RP, KR+Hybrid (from left to right). GP has CCR 0.86 and width 5.12, NN + MD has CCR 0.90 and width 3.63, RP has CCR 0.99 and width 126.53, and KR + Hybrid has CCR 0.93 and width 3.05.

9.3 Hyperparameter Tuning

This task originates from HPOBench [34]. The parameters to be tuned are shown in Table 4.

Hyperparameter Choices
Initial LR {0.0005,0.001,0.005,0.01,0.05,0.1}0.00050.0010.0050.010.050.1\{0.0005,0.001,0.005,0.01,0.05,0.1\}{ 0.0005 , 0.001 , 0.005 , 0.01 , 0.05 , 0.1 }
Batch Size {8,16,32,64}8163264\{8,16,32,64\}{ 8 , 16 , 32 , 64 }
LR Schedule {{\{{ cosine, fixed }}\}}
Activation of Layer 1 {{\{{ Tanh, ReLU }}\}}
Width of Layer 1 {16,32,64,128,256,512}163264128256512\{16,32,64,128,256,512\}{ 16 , 32 , 64 , 128 , 256 , 512 }
Dropout rate of Layer 1 {0.0,0.3,0.6}0.00.30.6\{0.0,0.3,0.6\}{ 0.0 , 0.3 , 0.6 }
Activation of Layer 2 {{\{{ Tanh, ReLU }}\}}
Width of Layer 2 {16,32,64,128,256,512}163264128256512\{16,32,64,128,256,512\}{ 16 , 32 , 64 , 128 , 256 , 512 }
Dropout rate of Layer 2 {0.0,0.3,0.6}0.00.30.6\{0.0,0.3,0.6\}{ 0.0 , 0.3 , 0.6 }
Table 4: Tunable hyperparameters and search space in the HBOBench task.

In this task, we follow the procedure in [65] to densify the search space by one-hot encoding for categorical hyperparameters (including LR Schedule, activation function of layers 1111 and 2222), with each variable with domain [0,1] representing each category. For each of the other hyperparameters, we use one variable with domain [0,1] with space equalized separated for all candidates of that hyperparameter. Therefore, the search domain is [0,1]12superscript0112[0,1]^{12}[ 0 , 1 ] start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT.

9.4 Robot Arm Pushing

This 14D task is designed for optimizing the controllers of the robot’s two arms to push two objects to their target places o1subscript𝑜1o_{1}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and o2subscript𝑜2o_{2}italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, starting from positions s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Let us denote the ending positions of the two objects by e1subscript𝑒1e_{1}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and e2subscript𝑒2e_{2}italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, resulted from a specific control. The final reward is defined as

R(x)=s1o1+s2o2(s1e1+s1e2).𝑅𝑥delimited-∥∥subscript𝑠1subscript𝑜1delimited-∥∥subscript𝑠2subscript𝑜2delimited-∥∥subscript𝑠1subscript𝑒1delimited-∥∥subscript𝑠1subscript𝑒2R(x)=\left\lVert s_{1}-o_{1}\right\rVert+\left\lVert s_{2}-o_{2}\right\rVert-(% \left\lVert s_{1}-e_{1}\right\rVert+\left\lVert s_{1}-e_{2}\right\rVert).italic_R ( italic_x ) = ∥ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ + ∥ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ - ( ∥ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ + ∥ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ ) .

There are 14141414 parameters to control the location and rotation of the robot hands, pushing speed, moving direction and pushing time, presented in Table 5.

Hyperparameter of Arm 1/2 Ranges
Position x𝑥xitalic_x [5,5]55[-5,5][ - 5 , 5 ]
Position y𝑦yitalic_y [5,5]55[-5,5][ - 5 , 5 ]
Angle [0,2π]02𝜋[0,2\pi][ 0 , 2 italic_π ]
Torque [5,5]55[-5,5][ - 5 , 5 ]
Velocity vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [10,10]1010[-10,10][ - 10 , 10 ]
Velocity vysubscript𝑣𝑦v_{y}italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT [10,10]1010[-10,10][ - 10 , 10 ]
Push duration [2,30]230[2,30][ 2 , 30 ]
Table 5: Tunable hyperparameters and search space in the robot arm pushing task.

9.5 Rover Trajectory Planning

This task is a 60606060D problem, where our target is to optimize the trajectory of the rover, determined by our choices of 30303030 points, in a 2D plane. The reward is estimated in the following way:

f(x)=c(x)+λ(x0,1s1+x59,60o1)+b,𝑓𝑥𝑐𝑥𝜆subscriptdelimited-∥∥subscript𝑥01𝑠1subscriptdelimited-∥∥subscript𝑥5960𝑜1𝑏f(x)=c(x)+\lambda(\left\lVert x_{0,1}-s\right\rVert_{1}+\left\lVert x_{59,60}-% o\right\rVert_{1})+b,italic_f ( italic_x ) = italic_c ( italic_x ) + italic_λ ( ∥ italic_x start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT - italic_s ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ italic_x start_POSTSUBSCRIPT 59 , 60 end_POSTSUBSCRIPT - italic_o ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_b ,

where s𝑠sitalic_s and o𝑜oitalic_o are the starting position and the target position, x[0,1]60𝑥superscript0160x\in[0,1]^{60}italic_x ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT 60 end_POSTSUPERSCRIPT containing the points picked, and c(x)𝑐𝑥c(x)italic_c ( italic_x ) is a function to measure the cost of the trajectory determined by x𝑥xitalic_x.

Instructions for running these two tasks can be found in https://github.com/uber-research/TuRBO and https://github.com/zi-w/Ensemble-Bayesian-Optimization.