GloptiNets: Scalable Non-Convex Optimization
with Certificates
Abstract
We present a novel approach to non-convex optimization with certificates, which handles smooth functions on the hypercube or on the torus. Unlike traditional methods that rely on algebraic properties, our algorithm exploits the regularity of the target function intrinsic in the decay of its Fourier spectrum. By defining a tractable family of models, we allow at the same time to obtain precise certificates and to leverage the advanced and powerful computational techniques developed to optimize neural networks. In this way the scalability of our approach is naturally enhanced by parallel computing with GPUs. Our approach, when applied to the case of polynomials of moderate dimensions but with thousands of coefficients, outperforms the state-of-the-art optimization methods with certificates, as the ones based on Lasserre’s hierarchy, addressing problems intractable for the competitors.
1 Introduction
Non-convex optimization is a difficult and crucial task. In this paper, we aim at optimizing globally a non-convex function defined on the hypercube, by providing a certificate of optimality on the resulting solution. Let be a smooth function on . Here we provide an algorithm that given , an estimate of the minimizer of
produces an , that constitutes an explicit certificate for the quality of , of the form
with probability . The literature abounds of algorithms to optimize non-convex functions. Typically they are either (a) heuristics, very smart, but with no guarantees of global convergence Moscato et al. (1989); Horst and Pardalos (2013) (b) variation of algorithms used in convex optimization, which can guarantee convergence only to local minima Boyd and Vandenberghe (2004) (c) algorithms with only asymptotic guarantees of convergence to a global minimum, but no explicit certificates Van Laarhoven et al. (1987). In general, the methods recalled above are quite fast to produce some solution, but don’t provide guarantees on its quality, with the result that the produced point can be arbitrarily far from the optimum, so they are used typically where non-reliable results can be accepted.
On the contrary, there are contexts where an explicit quantification of the incurred error is crucial for the task at hand (finance, engineering, scientific validation, safety-critical scenarios Lasserre (2009)). In these cases, more expensive methods that provide certificates are used, such as polynomial sum-of-squares (poly-SoS) Lasserre (2001, 2009). These kinds of techniques are quite powerful since they provide certificates in the form above, often with machine-precision error. However, (a) they have reduced applicability since must be a multivariate polynomial (possibly sparse, low-degree) and must be known in its analytical form (b) the resulting algorithm is a semi-definite programming optimization on matrices whose size grows very fast with the number of variables and the degree of the polynomial, becoming intractable already in moderate dimensions and degrees.
Our approach builds and extends the more recent line of works on kernel sum-of-squares, and in particular the work of Woodworth et al. (2022) based on the Fourier analysis. It mitigates the limitations of poly-SoS methods in both aspects: (a) we can deal with any function (not necessarily a polynomial) for which the Fourier transform is known and (b) the resulting algorithm leverages the smoothness properties of the objective function as Woodworth et al. (2022) rather than relying on its algebraic structure leading to way more compact representations than poly-SoS. Contrary to Woodworth et al. (2022), we fully leverage the power of the certificate allowing for a drastic reduction of the computational cost of the method. Indeed, we cast the minimization in terms of a way smaller problem, similar to the optimization of a small neural network that, albeit again non-convex, produces efficiently a good candidate on which we then compute the certificate.
Notably, our focus lies on a posteriori guarantees: we define a family of models that allow for efficient computation of certificates. Once the model structure is established, we have ample flexibility in training the model, offering various possibilities to achieve good certificates in practical scenarios, while still using well-established and effective techniques in the field of deep neural networks (DNN) Goodfellow et al. (2016) to reduce the computational burden of the approach.
Our contributions can be summarized as follows:
-
•
We propose a new approach to global optimization with certificates which drastically extends the applicability domain allowed by the state of the art, since it can be applied to any function for which we can compute the Fourier transform (not just polynomials).
-
•
The proposed approach is naturally tailored for GPU computations and provides a refined control of time and memory requirements of the proposed algorithm, contrary to poly-SoS methods (whose complexity scales dramatically and in a rigid way with dimension and degree of the polynomial).
-
•
From a technical viewpoint, we improve the results in Woodworth et al. (2022), by develo** a fast stochastic approach to recover the certificate in high probability (theorem 3), and we generalize the formulation of the problem to allow the use of powerful techniques from DNN, still providing a certificate on the result (section 3, in particular algorithm 1)
-
•
In practical applications, we are able to provide certificates for functions in moderate dimensions, which surpasses the capabilities of current state-of-the-art techniques. Specifically, as shown in the experiments we can handle polynomials with thousands of coefficients. This achievement marks an important milestone towards utilizing these models to provide certificates for more challenging real-life problems.
1.1 Previous work
Polynomial SoS.
In the field of certificate-based polynomial optimization, Lasserre’s hierarchy plays a pivotal role Lasserre (2001, 2009). This hierarchy employs a sequence of SDP relaxations with increasing size proportional to (where is the dimension of the space and is a parameter that upper bounds the degree of the polynomial) and that ultimately converges to the optimal solution when . While Lasserre’s hierarchy is primarily associated with polynomial optimization, its applicability extends beyond this domain. It offers a specific formulation for the more general moment problem, enabling a wide range of applications; see Henrion et al. (2020) for an introduction. For polynomial optimization problems such as in eq. 1, a significant amount of research has been dedicated to leveraging problem structure to improve the scalability of the hierarchy. This research has predominantly focused on exploiting very specific sparsity patterns among the variables of the polynomial, enabling the handling in these restricted scenarios of instances ranging from a few variables to even thousands of variables Waki et al. (2006); Wang et al. (2021b, a). There have been theoretical results regarding optimization on the hypercube Bach and Rudi (2023); Laurent and Slot (2022), but there are no algorithms handling them natively. Furthermore, alternative approaches exist that exploit different types of structure, such as the constant trace property Mai et al. (2022).
Kernel SoS.
Kernel Sum of Squares (K-SoS) is an emerging research field that originated from the introduction of a novel parametrization for positive functions in Marteau-Ferey et al. (2020). This approach has found application in various domains, including Optimal Control Berthier et al. (2022), Optimal Transport Muzellec et al. (2021) and modeling probability distribution Rudi and Ciliberto (2021). In the context of function optimization, two types of theoretical results have been explored: a priori guarantees Rudi et al. (2020) and a posteriori guarantees Woodworth et al. (2022). A priori guarantees offer insights into the convergence rate towards a global optimum of the function, giving a rate on the number of parameters and the complexity necessary to optimize a function up to a given error. For example, Rudi et al. (2020) proposes a general technique to achieve the global optimum, with error of a function that is -times differentiable, by requiring a number of parameters essentially in the order of , allowing to avoid the curse of dimensionality in the rate, when the function is very regular, i.e., , while typical black-box optimization algorithms have a complexity that scales as . A-posteriori guarantees focus on providing a certificate for the minimum found by the algorithm. In particular, Woodworth et al. (2022), provides both a-priori guarantee and a-posteriori certificates; however, the model considered makes it computationally infeasible to provide certificates in dimension larger than .
To conclude, approaches based on kernel-SoS allow to extend the applicability of global optimization with certificates methods to a wider family of functions and on exploiting finer regularity properties beyond just the number of variables and the degrees of a polynomial. By comparison, we focus on making the optimization amenable to high-performance GPU computation while retaining an a posteriori certificate of optimality.
2 Computing certificates with extended k-SoS
Without loss of generality (see next remark), with the goal of simplifying the analysis and using powerful tools from harmonic analysis, we cast the problem in terms of minimization of a periodic function over the torus, (we will denote it also as ). In particular, we are interested in minimizing periodic functions for which we know (or we can easily compute) the coefficients of its Fourier representation, i.e.
(1) |
where is the set of integers. This setting is already interesting on its own, as it encompasses a large class of smooth functions. It includes notably trigonometric polynomials, i.e. functions which have only a finite number of non-zero Fourier coefficients . Optimization of trigonometric polynomials arises in multiple research areas, such as the optimal power flow Van Hentenryck (18) or quantum mechanics Hilling and Sudbery (2010). Note that this problem is already NP-hard, as it encompasses for instance the Max-Cut problem Waldspurger et al. (2013). Even so, we will consider the more general case where we can evaluate function values of , along with its Fourier coefficient , and we have access to its norm in a certain Hilbert space. This norm can be computed numerically for trigonometric polynomials, and more generally reflects the regularity (degree of differentiability) of the function, and thus the difficulty of the problem.
Remark 1 (No loss of generality in working on the torus).
Given a (non-periodic) function we can obtain a periodic function whose minimum is exactly and from which we can recover . Indeed, following the classical Chebychev construction, define as the componentwise application of to the elements of , i.e. and define as for . It is immediate to see that (a) is periodic, and, (b) since is invertible on and its image is exactly , we have where
We discuss an efficient representation of these problems in section 3.3.
2.1 Certificates for global optimization and k-SoS
A general “recipe” for obtaining a certificates was developed in Woodworth et al. (2022) where, in particular, it was derived the following bound (Woodworth et al., 2022, see Thm. 2)
(2) |
where is the norm of the Fourier coefficients of a periodic function , i.e.
(3) |
and the is taken over that is a class of non-negative functions. The paper Woodworth et al. (2022) then chooses to be the set of positive semidefinite models, leading to a possibly expensive convex SDP problem. Our approach instead starts from the following two observations: (a) the lower bound in eq. 2 holds for any set of non-negative functions, not necessarily convex, moreover (b) any candidate solution of the supremum in eq. 2 would constitute a lower bound for , so there is no need to solve eq. 2 exactly. This yields the following theorem
Theorem 1.
Given a point and a non-negative and periodic function , we have
(4) |
Proof.
Since is the minimizer of , then . Moreover, since and are feasible solutions for the r.h.s. of eq. 2, we have
from which we derive that . ∎
In particular, since any good candidate is enough to produce a certificate, we consider the following class of non-negative functions that can be seen as a two-layer neural network.
Definition 1 (extended K-SoS model on the torus).
Let be a periodic function in the first variable and let . Given a set of anchors and a matrix , we define the K-SoS model with
(5) |
The functions represented by the model above are non-negative and periodic. The model is an extension of the k-SoS model presented in Marteau-Ferey et al. (2020), where the points cannot be optimized. Moreover it has the following benefits at the expense of the convexity in the parameters:
- 1.
-
2.
The extended model can have a reduced number of parameters, by choosing a matrix with or . This will drastically improve the cost of the optimization, while not impacting the approximation properties of the model, since a good approximation is still possible with already proportional to (Rudi et al., 2020, see Thm. 3).
-
3.
The extended model does not require any positive semidefinite constraint on the matrix (contrary to the base model) that is typically a well-known bottleneck to scale up the optimization in the number of parameters Marteau-Ferey et al. (2020). In the extended model we trade the positive semidefinite constraint with non-convexity. However this allows us to use all the advanced and effective techniques we know for unconstrained (or box-constrained) non-convex optimization for (two-layers) neural networks Goodfellow et al. (2016).
To conclude the picture on the k-SoS models, a critical aspect of the model is the choice of , since it must guarantee good approximation properties and at the same time we need to compute easily its Fourier coefficients since we need to evaluate . To this aim, a good candidate for are the reproducing kernels defined on the torus Steinwart and Christmann (2008). We use shift-invariant kernels, enabling a convenient analysis of the associated RKHS through their Fourier Transform.
Definition 2 (Reproducing kernel on the torus).
Let be a real function on , with positive Fourier Transform and . Let be the kernel defined with
(6) |
Then, is a r.k bounded by . We denote its Reproducing kernel Hilbert Space (RKHS) and by the associated RKHS norm
Define . We assume that we can compute (and sample from, see next section) , i.e., the Fourier transform of , corresponding to , for all .
By choosing such a , the models inherit the good approximation properties derived in Marteau-Ferey et al. (2020); Rudi and Ciliberto (2021). We conclude by recalling that shift-invariant r.k kernel have a positive Fourier transform due to Bochner’s theorem Rudin (1990). The fact that is bounded by can be seen with . Finally, note that the Fourier coefficients of an extended k-SoS model can be computed exactly, as in shown e.g. later in lemma 1.
2.2 Providing certificates with the -norm
As discussed in the previous section our approach for providing a certificate on relies on first obtaining using a fast algorithm without guarantees and solving approximately eq. 2 to obtain the certificate (see theorem 1). With this aim, now we need an efficient way to compute the norm . We use here a stochastic approach. Introducing a probability (that later will be chosen as a rescaled version of in definition 2) on we rewrite the -norm
(7) |
which yields an objective that is amenable to stochastic optimization. From there, Woodworth et al. (2022) computes a certificate by truncating the sum to a hypercube of size and bounding the remaining terms with a smoothness assumption on , which enables to control the decay of . We want to avoid this cost exponential in the dimension so we proceed differently.
Probabilistic estimates with the norm.
Given that the -norm can be written as an expectation in eq. 7, we approximate it with an empirical mean given with i.i.d samples . Now, note that the variance of can be upper bounded by a Hilbert norm, as
(8) |
with the RKHS from definition 2 with kernel . This allows to quantify the deviation of from , with e.g. Chebychev’s inequality, as shown in next theorem.
Theorem 2 (Certificate with Chebychev Inequality).
Let be a probability distribution on , and a positive function. Let and be the empirical mean of obtained with samples . Then, a certificate with probability is given with
(9) |
Proof.
From its definition in eq. 7, we see that an unbiased estimator of the -norm is given by . Then, Chebychev’s inequality states that with probability at least . Using the computation of the variance in eq. 8, it follows that with probability at least . Plugging this expression into eq. 2, we obtain the result. ∎
Note that the norm in can be developed with (assuming for conciseness that is comprised in the -frequency of )
(10) |
Thus, theorem 2 provides a certificate of as long as we can (i) evaluate the Fourier transform of and (ii) compute its Hilbert norm in some r.k induced by . In next section, we detail the choice we make to achieve this efficiently, with kernels amenable to GPU computation, scaling to thousands of coefficients.
Remark 2 (Using a RKHS norm instead of the -norm).
Note that since sums to , the associated kernel is bounded by . Hence , and the latter could be used instead of the -norm in eq. 2. There are two reasons for taking our approach instead. Firstly, the -norm is always tighter that a RKHS norm (see e.g. (Woodworth et al., 2022, Lem. 4)); secondly, we cannot compute efficiently and have to rely instead on another upper bound. However, taking the number of samples alleviates this issue.
Exponential concentration bounds with MoM.
The scaling in in theorem 2 can be prohibitive if one requires a high probability on the result (). Hopefully, alternative estimator exist for those cases. The Median-of-Mean estimator is an example, illustrated in theorem 3.
Theorem 3 (Certificate with MoM estimator).
Let be a probability distribution on , and . Draw frequencies . Define the MoM estimator with the following: for s.t. and , , write a partition of ; then
(11) |
A certificate on with probability follows, with
(12) |
Proof.
To conclude this section, bounding the norm from above with the -norm in eq. 3 enables to obtain a certificate on , as shown in theorem 1. The -norm requires an infinite number of computation in the general case, but can be bounded efficiently with a probabilistic estimate, given by theorem 2 or theorem 3. This is summed up in fig. 1. Note that the difference is a source of conservatism in the certificate which we do not quantify – yet, the -norm is optimal for a class of norms, see (Woodworth et al., 2022, Lemma 3).
3 Algorithm and implementation
3.1 Bessel kernel
We now detail the specific choice of kernel we make in order to compute the certificate of theorem 2 or theorem 3 efficiently. Our first observation is to use a kernel stable by product, so that we can easily characterize a Hilbert space the model belongs to. This restricts the choice to the exponential family. That’s why we define, for a parameter ,
(14) |
with the modified Bessel function of the first kind (Watson, 1922, p.181). Then, define as in definition 2, and take a tensor product to extend the definition of to multiple dimension, i.e. for any . We refer to this kernel as the Bessel kernel, and the associated RKHS as . It is stable by product as . This is key to compute the Fourier transform of the model , and in contrast to previous approaches which used the exponential kernel with Woodworth et al. (2022); Bach and Rudi (2023).
In the following, is a K-SoS model defined as in definition 1, with the Bessel kernel of parameter defined in eq. 14.
Lemma 1 (Fourier coefficient of the Bessel kernel).
For , the Fourier coefficient of in can be computed in time with
(15) |
Proof.
The second necessary ingredient for using the certificate of theorem 2 is computing a RKHS norm for . It relies on the inclusion of into the bigger space of symmetric operator .
Lemma 2 (Bound on the RKHS norm of ).
belongs to , and is bounded by the Hilbert-Schmidt norm of , which can be computed in time, with
(17) |
Proof.
Assume that ; the reasoning can be extended to multiple dimensions with the tensor product. From the computation of the Fourier coefficient in lemma 1 and the fact that , we have that hence . Finally, since the kernel is stable by product, , so we can use e.g. (Paulsen and Raghupathi, 2016, Thm. 5.16), with and , with the operator . ∎
3.2 The algorithm: GloptiNets
We can now describe how GloptiNets yields a certificate on . The key observation is that no matter how is obtained our model from definition 1, we will always be able to compute a certificate with theorems 2 and 3. Thus, even though optimizing eq. 9 w.r.t is highly non-convex, we can use any optimization routine and check empirically its efficiency by looking at the certificate. Finally, thanks to its low-rank structure it is cheaper to evaluate than evaluating its Fourier coefficient. This is formally shown in proposition 2 in appendix A, where a block-diagonal structure for the model is also introduced. That’s why we first optimize , where is a proxy for the norm, e.g. the log-sum-exp on a random batch of points111Another detail of practical importance is that this loss can be efficiently backpropagated through; on the other hand, the certificate is not easily vectorized, and the Bessel function involved would require specific approximation to be efficiently backpropagated through.:
(18) |
This optimization can be carried out by any deep learning libraries with automatic differentiation and any flavour of gradient ascent. Only afterwards do we compute the certificate with theorems 2 and 3. This procedure is summed up in algorithm 1.
Remark 3 (Providing a candidate).
In algorithm 1, a candidate estimate for the minimum value is necessary. However, it is possible to overcome this requirement by incorporating as a learnable parameter within the training loop. Moreover, can be learned using techniques similar to those in Rudi et al. (2020): by replacing the lower bound with a parabola centered at , becomes a candidate for with precision corresponding to the tightness of the certificate. Note however that this method introduces additional hyperparameters.
3.3 Specific implementation for the Chebychev basis
As already observed in Bach and Rudi (2023), a result on trigonometric polynomial on directly extends to a real polynomials on . The reason for that is that minimizing on amounts to minimizing the trigonometric polynomial on . Note however that is an even function in all dimension, as for any , . Thus, approximating with a K-SoS model of definition 1 is suboptimal, in the sense that we could approximate only on , which is smaller. Put differently, the Fourier coefficient of are real by design: it would be convenient to enforce this structure in the model . This is achieved with proposition 1.
Proposition 1 (Kernel defined on the Chebychev basis).
Let be a real, even function on the torus, bounded by , as in eq. 6. Let be the kernel defined on by
(19) |
Then is a symmetric, p.d., hence reproducing kernel, bounded by , with explicit feature map given by
(20) |
The proof is available in appendix B. It simply relies on expanding the definition of in eq. 19. The resulting expression in eq. 20 exhibits only cosine terms (in the decomposition of ). This enables to directly extend the PSD models from definition 1 with such kernels. Finally, when used with the Bessel kernel of eq. 14, we recover an easy computation of the Chebychev coefficient, as shown in lemma 3, in time. This enables to approximate any function expressed on the Chebychev basis. Note that polynomials expressed in other basis can be certified too, by first operating a change of basis.
4 Experiments
The code to reproduce these experiments is available at
Settings.
Given a function , we compute a candidate with gradient descent and multiple initializations. The goal is then to certify that is indeed a global minimizer of . This is a common setup in the Polynomial-SoS literature Wang and Magron (2022). To illustrate the influence of the number of parameters, the positive model defined in definition 1 for GloptiNets designates either a small model GN-small with 1792 parameters, or a bigger model GN-big with parameters. The latter should have higher expressivity and better interpolate positive functions, leading to tighter certificates. All results for GloptiNets are obtained with confidence . All other details regarding the experiments are reported in appendix C.
Polynomials.
We first consider the case where is a random trigonometric polynomial. Note that this is a restrictive analysis, as GloptiNets can handle any smooth functions (i.e. with infinite non-zero Fourier coefficients). Polynomials have various dimension , degree , number of coefficients , but a constant RKHS norm . We compare the performances of GloptiNets to TSSOS, in its complex polynomial variant Wang and Magron (2022). The latter is used with parameters such that it executes the fastest, but without guarantees of convergence to the global minimum . Table 1 shows the certificates and the execution times (lower is better, in seconds) for TSSOS, GN-small and GN-big. Figure 2 provides certificate on a random polynomial, function of the number of parameters in .
Kernel mixtures.
While polynomials provide ground for comparison with existing work, GloptiNets is not confined to this function class. This is evidenced by experiments on kernel mixtures, where our approach stands as the only viable alternative we are aware of. The function we certify are of the form , where is the Bessel kernel of eq. 14. Kernel mixtures are ubiquitous in machine learning and arise e.g. when performing kernel ridge regression. Certificates obtained on mixtures are compared with those obtained on polynomials in fig. 2, function of the model size .
TSSOS | GN-small | GN-big | ||||||
---|---|---|---|---|---|---|---|---|
Certif. | Certif. | Certif. | ||||||
out of memory! | - | |||||||
out of memory! | - |
Results.
There are two key hindsight about the performances of GloptiNets. Firstly, its certificate does not depend on the structure of the function to optimize. Thus, although GloptiNets does not match the performances of TSSOS on small polynomials, it can tackle polynomials which cannot be handled by competitors, with arbitrarily as many coefficients (). For instance, TSSOS cannot handle problems with in table 1. More importantly, GloptiNets can certify a richer class of functions than polynomials, among which kernel mixtures. The performances of GloptiNets mostly depends on the complexity of the function to certify, as measured with its RKHS norm.
Secondly, note that a bigger model yields tighter certificate. This is detailed in fig. 2, where the same function is optimized with various models. The dependency of the certificate on the norm of is shown in fig. 3 in appendix C, along with experiments with Chebychev polynomials.
5 Limitations
One limitation of GloptiNets is the trade-off resulting from its high flexibility for obtaining a certificate as in algorithm 1. While this flexibility offers numerous advantages, it also introduces the need for an extensive hyperparameter search. Although we have identified a set of hyperparameters that align with deep learning practices – utilizing a Momentum optimizer with cosine decay and a large initial learning rate – the optimal settings may vary depending on the specific problem at hand.
In the same vein, the certificates given by GloptiNets are of moderate accuracy. While adding more parameters into the k-SoS model certainly helps (as shown in fig. 2), alternative optimization scheme to interpolate with might provide easier improvement. For instance, we found that using approximate second-order scheme in algorithm 1 is key to obtaining good certificates.
In the specific settings of polynomial optimization, we highlight that our model is not competitive on problems which exhibits some algebraic structure, as for instance term sparsity or the constant trace property. Typically, problems with coefficients of low degrees (less or equal than ), which encompass notably the OPF problem, are really well handled by the family of solvers TSSOS belongs to. Finally, GloptiNets does not handle constraints yet.
6 Conclusion
The GloptiNets algorithm presented in this work lays the foundation for a new family of solvers which provide certificates to non-convex problems. While our approach does not aim to replace the well-established Lasserre’s hierarchy for sparse polynomials, it offers a fresh perspective on tackling a new set of problems at scale. Through demonstrations on synthetic examples, we have showcased the potential of our approach. Further research directions include extensive parameter tuning to obtain tighter certificates, with the possibility of leveraging second-order optimization schemes, along with warm-restart schemes for application which requires solving multiple similar problems sequentially.
Acknowledgments.
AR acknowleges support of the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). AR acknowledges support of the European Research Council (grant REAL 947908). JM was supported by the ERC grant number 101087696 (APHE-LAIA project) and by ANR 3IA MIAI@Grenoble Alpe (ANR-19-P3IA-0003).
References
- Bach and Rudi [2023] Francis Bach and Alessandro Rudi. Exponential convergence of sum-of-squares hierarchies for trigonometric polynomials. SIAM Journal on Optimization, 33(3):2137–2159, 2023.
- Berthier et al. [2022] Eloïse Berthier, Justin Carpentier, Alessandro Rudi, and Francis Bach. Infinite-Dimensional Sums-of-Squares for Optimal Control. In 2022 IEEE 61st Conference on Decision and Control (CDC), pages 577–582, December 2022. doi: 10.1109/CDC51059.2022.9992396.
- Boyd and Vandenberghe [2004] Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
- Devroye et al. [2016] Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I. Oliveira. Sub-Gaussian Mean Estimators. The Annals of Statistics, 44(6):2695–2725, 2016.
- Dũng et al. [2017] Dinh Dũng, Vladimir N. Temlyakov, and Tino Ullrich. Hyperbolic Cross Approximation. arXiv:2211.04889, April 2017.
- Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
- Henrion et al. [2020] Didier Henrion, Milan Korda, and Jean-Bernard Lasserre. The Moment-SOS Hierarchy, volume 4 of Optimization and Its Applications. World Scientific Publishing Europe Ltd., December 2020. doi: 10.1142/q0252.
- Hilling and Sudbery [2010] Joseph J. Hilling and Anthony Sudbery. The geometric measure of multipartite entanglement and the singular values of a hypermatrix. Journal of Mathematical Physics, 51(7):072102, July 2010.
- Horst and Pardalos [2013] Reiner Horst and Panos M Pardalos. Handbook of global optimization, volume 2. Springer Science & Business Media, 2013.
- Josz and Molzahn [2018] Cédric Josz and Daniel K. Molzahn. Lasserre Hierarchy for Large Scale Polynomial Optimization in Real and Complex Variables. SIAM Journal on Optimization, 28(2):1017–1048, January 2018.
- Lasserre [2001] Jean B. Lasserre. Global Optimization with Polynomials and the Problem of Moments. SIAM Journal on Optimization, 11(3):796–817, January 2001. doi: 10.1137/S1052623400366802.
- Lasserre [2009] Jean Bernard Lasserre. Moments, Positive Polynomials and Their Applications, volume 1 of Series on Optimization and Its Applications. October 2009. doi: 10.1142/p665.
- Laurent and Slot [2022] Monique Laurent and Lucas Slot. An effective version of schmüdgen’s positivstellensatz for the hypercube. Optimization Letters, September 2022. doi: 10.1007/s11590-022-01922-5.
- Mai et al. [2022] Ngoc Hoang Anh Mai, J. B. Lasserre, Victor Magron, and Jie Wang. Exploiting Constant Trace Property in Large-scale Polynomial Optimization. ACM Transactions on Mathematical Software, 48(4):40:1–40:39, December 2022.
- Marteau-Ferey et al. [2020] Ulysse Marteau-Ferey, Francis Bach, and Alessandro Rudi. Non-parametric Models for Non-negative Functions. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12816–12826. Curran Associates, Inc., 2020.
- Moscato et al. [1989] Pablo Moscato et al. On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. Caltech concurrent computation program, C3P Report, 826(1989):37, 1989.
- Muzellec et al. [2021] Boris Muzellec, Adrien Vacher, Francis Bach, François-Xavier Vialard, and Alessandro Rudi. Near-optimal estimation of smooth transport maps with kernel sums-of-squares. arXiv:2112.01907, December 2021.
- Paulsen and Raghupathi [2016] Vern I. Paulsen and Mrinal Raghupathi. An Introduction to the Theory of Reproducing Kernel Hilbert Spaces. Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, 2016. doi: 10.1017/CBO9781316219232.
- Rudi and Ciliberto [2021] Alessandro Rudi and Carlo Ciliberto. PSD Representations for Effective Probability Models. In Advances in Neural Information Processing Systems, volume 34, pages 19411–19422. Curran Associates, Inc., 2021.
- Rudi et al. [2020] Alessandro Rudi, Ulysse Marteau-Ferey, and Francis Bach. Finding Global Minima via Kernel Approximations. arXiv:2012.11978, December 2020.
- Rudin [1990] Walter Rudin. The Basic Theorems of Fourier Analysis. In Fourier Analysis on Groups, chapter 1, pages 1–34. John Wiley & Sons, Ltd, 1990. doi: 10.1002/9781118165621.ch1.
- Steinwart and Christmann [2008] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.
- Van Hentenryck [18] Pascal Van Hentenryck. Machine Learning for Optimal Power Flows. INFORMS Tutorials in Operations Research, October 18.
- Van Laarhoven et al. [1987] Peter JM Van Laarhoven, Emile HL Aarts, Peter JM van Laarhoven, and Emile HL Aarts. Simulated annealing. Springer, 1987.
- Waki et al. [2006] Hayato Waki, Sunyoung Kim, Masakazu Kojima, and Masakazu Muramatsu. Sums of Squares and Semidefinite Program Relaxations for Polynomial Optimization Problems with Structured Sparsity. SIAM Journal on Optimization, 17(1):218–242, January 2006. doi: 10.1137/050623802.
- Waldspurger et al. [2013] Irène Waldspurger, Alexandre d’Aspremont, and Stéphane Mallat. Phase Recovery, MaxCut and Complex Semidefinite Programming, July 2013.
- Wang and Magron [2022] Jie Wang and Victor Magron. Exploiting Sparsity in Complex Polynomial Optimization. Journal of Optimization Theory and Applications, 192(1):335–359, January 2022.
- Wang et al. [2021a] Jie Wang, Victor Magron, and Jean-Bernard Lasserre. Chordal-TSSOS: A Moment-SOS Hierarchy That Exploits Term Sparsity with Chordal Extension. SIAM Journal on Optimization, 31(1):114–141, January 2021a. doi: 10.1137/20M1323564.
- Wang et al. [2021b] Jie Wang, Victor Magron, and Jean-Bernard Lasserre. TSSOS: A Moment-SOS Hierarchy That Exploits Term Sparsity. SIAM Journal on Optimization, 31(1):30–58, January 2021b. doi: 10.1137/19M1307871.
- Watson [1922] G. N. Watson. A Treatise on the Theory of Bessel Functions. Cambridge University Press, 1922.
- Woodworth et al. [2022] Blake Woodworth, Francis Bach, and Alessandro Rudi. Non-Convex Optimization with Certificates and Fast Rates Through Kernel Sums of Squares. In Proceedings of Thirty Fifth Conference on Learning Theory, pages 4620–4642. PMLR, June 2022.
Appendix A Extensions
We explore additional extensions of GloptiNets that further enhance its appeal. We first describe a block diagonal structure for the model for faster evaluation, a theoretical splitting scheme for optimization, and finally a warm-start scheme.
A.1 Block diagonal structure for efficient computation
Without any further assumption, we see that a model from definition 1 can be evaluated in time; its Fourier coefficient given by lemma 1 in ; the bound on the RKHS norm is computed in time thanks to lemma 2; all that enables to compute a certificate, as stated in theorem 2, in time, where is the number of frequencies sampled. If the function to be minimized has big norm, we might need a large model size to have . Hence, we introduce specific structure on which makes it block-diagonal and better conditioned.
Proposition 2 (Block-diagonal PSD model).
Let be a PSD model as in definition 1, with anchors. Split them into groups, denoting them , and . Compute the Cholesky factorization of each kernel matrix . Then, define as a block-diagonal matrix, with blocks defined as , , and . Equivalently,
(21) |
Then can be evaluated in time, in time, and in time. The model has real parameters.
Proof.
Having defined as such, it is psd, of rank at most . Written , we can compute the Fourier coefficient by applying lemma 1 to each of the component. Adding the cost of computing results in complexity of . Finally, note that where
Then, defining the matrix of blocks of size s.t. for , , we have
(22) |
and each term in the sum can be written , which is computed in time, plus to compute the Cholesky factor. ∎
Denoting , note that
(23) |
with an orthonormal basis of as . Thus, each model’s coefficient is defined on an orthonormal basis, which makes the optimization easier. Of course, this comes at an added complexity, which could be alleviated by using e.g. an incomplete Cholesky factorization instead.
Remark 4 (Relation to Term Sparsity in POP).
The successful application of polynomial hierarchies to problems with thousands of variables rely on making the moment matrix having a block structure Wang et al. [2021b, a]. If the monomial basis has size , the constraint is replaced with and . This enables to solve SDP of size at most instead of one of size . Our model in proposition 2 follows a similar route for having a lower computational budget.
A.2 Global optimization with splitting scheme
While GloptiNets can provide certificates for functions, it falls behind local solvers in terms of competitiveness. The challenge lies in the fact that finding a certificate is considerably more difficult than finding a local minimum, as it necessitates the uniform approximation of the entire function. However, we present a novel algorithmic framework that has the potential to enhance the competitiveness of GloptiNets with local solvers while simultaneously delivering certificates. Our approach involves partitioning the search domain into multiple regions and computing lower bounds for each partition. By discarding portions of the domain where we can certify that the function exceeds a certain threshold, the algorithm progressively simplifies the optimization problem and removes areas from consideration. Moreover, such an approach is naturally well suited to parallel computation.
The algorithm relies on a divide-and-conquer mechanism. First, we split the hypercube in regions, where is the number of core available. We compute an upper bound with a local solver. For each region, we run GloptiNets in parallel, computing a certificate at regular interval. As soon as the certificate is bigger than the upper bound, we stop the process: we know that the global minimum is not in the associated region. We can then reallocate the freed computing power by splitting the biggest current region, which yields an easier problem. We stop as soon as the region considered are small enough. This is summarized in algorithm 2, where indicates the loop run in parallel.
Note that minimizing on a hypercube of center and size amounts to minimizing on , which is another Chebychev polynomial whose coefficients can be evaluated efficiently thanks to the order-2 relation every orthonormal polynomial satisfy. For Chebychev polynomials, that is .
A.3 Warm restarts
Our model distinguishes itself by leveraging the analytical properties of the objective function, rather than relying solely on algebraic characteristics. This approach offers a notable advantage, as closely related functions can naturally benefit from a warm restart. For example, if we already have a certificate for a function using a PSD model , and we seek to compute a certificate for a similar function , we can readily employ GloptiNets by initializing the PSD model with . Indeed, if , we can expect , so we can expect the optimization to be faster.
In contrast, P-SoS methods, which rely on SDP programs, cannot directly adapt to new problems without significant effort. For instance, if a new component is introduced, an entirely new SDP must be solved. Our model’s ability to accommodate related yet distinct problems could prove highly valuable in domains with a frequent need to certify different but closely related problems. In the industry, the Optimal Power Flow (OPF) problem requires periodic solves every 5 minutes Van Hentenryck [18]. With GloptiNets, once the initial challenging solve is performed, subsequent solves become easier assuming minimal changes in supply and demand conditions.
A.4 Optimizing the certificate directly
As explained in section 3.2 where GloptiNets is introduced, we optimize a proxy of the norm rather than the certificate of theorems 2 and 3. This proxy is the log-sum-exp on a random batch of points. The reason for this is that evaluating an extended k-SoS model on requires time, while evaluating on requires time. Yet, optimizing the certificate directly could probably help obtaining higher-precision certificate. Lemma 4 in appendix D sketches a method to reduce the computational cost of the Fourier computation from to .
Appendix B Kernel defined on the Chebychev basis
In this section we describe the approach we take to model functions written in the Chebychev basis. For such a polynomial, a naive approach would simply model as a trigonometric polynomial. However, note that the decomposition of only has cosine terms. Thus, approximating efficiently requires a PSD model which has only cosine terms in its Fourier decomposition. This is achieved by using a kernel written in the Chebychev basis, as introduce in proposition 1, for which we now provide a proof.
Proof of proposition 1..
We now use this kernel with the Bessel function , i.e. we define the kernel on to satisfy
(24) |
As it was the case for the torus, this kernel enables an easy characterization of a RKHS in which an associated PSD model lives.
Lemma 3 (Chebychev coefficient of the Bessel kernel).
Let be a PSD model as in definition 1, with the kernel of eq. 24. Then, the Chebychev coefficient of can be computed in time with
(25) |
where
Proof.
Expanding and definition of Chebychev coefficient.
From the definition of in eq. 5, we have
(26) |
We consider and . We denote s.t.
with the bijectivity of on . We now compute the Chebychev coefficient of . Denoted , this is
or equivalently
(27) |
Chebychev coefficient of kernel product.
With the definition of the kernel in proposition 1, eq. 19, we have
Now use the sum-to-product formula with the cosines to obtain
(28) |
We simplify this expression by introducing
(29) |
Then, eq. 28 becomes
(30) |
We recognize the definition of the kernel (which is not a surprise as we chose the kernel to be stable by product). However, we need variables in to retrieve the proper definition of the kernel. Instead, we use lemma 5 on eq. 30 combined with eq. 27, to obtain
which gives
(31) |
Equation 31 contains the Chebychev coefficient of the product of two kernel function as defined in eq. 27. Plugging this result into the definition of in eq. 26, and noting that , we obtain the result. ∎
Thanks to lemma 3, we see that a model defined as in definition 1 with the Bessel kernel of eq. 24 as its Chebychev coefficients decaying in . Hence, it belongs to , the RKHS associated to .
Appendix C Additional details on the experiments
Tuning the hyperparameters.
The time reported in section 4 does not take into account the experiments needed to find a good set of hyperparameters. The parameters tuned were the type of optimizer, the decay of learning rate, and the regularization on the Frobenius norm of .
Regularization.
Regularization is performed by approximating the norm with a proxy which is faster to compute. We use instead of in eq. 22.
Hardware.
GloptiNets was used with NVIDIA V100 GPUs for the interpolation part, and Intel Xeon CPU E5-2698 v4 @ 2.20GHz for computing the certificate. TSSOS was run on a Apple M1 chip with Mosek solver.
Configuration of TSSOS.
We use the lowest possible relaxation order (i.e. ), along with Chordal sparsity. We use the first relaxation step of the hierarchy. In these settings, TSSOS is not guaranteed to converge to but will executes the fastest.
Certificate vs. number of parameter for a given function.
In fig. 2, the target function is a random polynomial of norm or , or a kernel mixture with coefficients of norm or . The models forming the blue line are defined as in proposition 2, with rank, block size and number of blocks equal to respectively, with the block size we vary. The number of frequencies sampled to compute the certificate is , and accounts for the fact that the bound on the variance becomes larger than the MOM estimator for large models.
Certificate vs. problem difficulty for a given model.
We have 3 related parameters: the quality of the optimization (given by the certificate), the expressivity of the model (given by its number of parameters), and the difficulty of the optimization (given by the norm of the function). In fig. 3, we fix the latter and plot the relation between the first two. Here, we fix the model with parameters , and we optimize a polynomial in of degree , with RKHS norm ranging from to . The certificates obtained are given in fig. 3. The resulting plot exhibits a clear polynomial relation between the certificate and the norm of the function, with a slope of . This suggest that the certificate behaves as .
Comparison with TSSOS on the Fourier basis.
In table 1, the polynomials all have a RKHS norm of . The small model is defined as in proposition 2, with rank, block size and number of blocks equal to respectively. For the big models, those values are . The certificate is the maximum of the Chebychev bound of theorem 2 and the MoM bound of theorem 3. The number of frequencies sampled is .
Comparison with TSSOS on the Chebychev basis.
We compare GloptiNets with TSSOS on random Chebychev polynomials in table 2, similarly to the comparison with trigonometric polynomials in table 1. Minimizing polynomials defined on the canonical basis is easier: contrary to trigonometric polynomials, there is no need to account for the imaginary part of the variable. If is the dimension, complex polynomials are encoded in a variable of dimension in TSSOS, following the definition of Hermitian Sum-of-Squares introduced in Josz and Molzahn [2018]. Hence, the random polynomials we consider are characterized by the dimension and their number of coefficients ; instead of bounding the degree, we use all the basis elements for which . The maximum degree is then . The RKHS norm of is fixed to . As with the comparison on Trigonometric polynomial table 1, we see that GloptiNets provides similar certificates no matter the number of coefficients in . Even though it lags behind TSSOS for small polynomials, it handles large polynomials which are intractable to TSSOS. The “small” and “big” models have the same structure as for the trigonometric polynomials experiments.
TSSOS | GN-small | GN-big | ||||||
---|---|---|---|---|---|---|---|---|
Certif. | Certif. | Certif. | ||||||
Out of memory! | - |
Sampling from the Bessel distribution.
The function decays rapidly. In fact, with , which is the value used to generate the random polynomials, it falls under machine precision as soon as . Thus, we approximate the distribution with a discrete one with weights for s.t. the result is above the machine precision. We then extend it to multiple dimension with a tensor product. Finally, we use a hash table to store the already sampled frequency, to make the evaluation of million of frequencies much faster. For instance in dimension , sampling frequencies from the Bessel distribution of parameter on yields only unique frequencies. This allows for tighter certificates, as it makes the r.h.s of eq. 9, in , much smaller. Note that the time to generate this hash table is not reported in tables 1 and 2, and of the order of a few seconds.
Optimizing a kernel mixture.
As it is the case with polynomials, when optimizing a function of the form the certificate provided by GloptiNets only depends on the function norm and not on e.g. the number of coefficients . This is illustrated in fig. 4.
Appendix D Fourier coefficients in linear time
Lemma 4 (Fourier coefficient of the Bessel kernel in linear time).
Let be an extended k-SoS model as in definition 1. Then, its Fourier transform can be evaluated in linear time in with
(32) |
where
and is defined with lemma 6.
Lemma 4 provides a formula for computing which is linear in , but which still requires numerical approximation to compute the sum on . For instance, restraining the sum to the hyperbolic cross Dũng et al. [2017]
would result in a complexity of and should produce reasonably accurate estimate of for low .
Furthermore, since is real-even w.r.t , the inner-product in eq. 36 can be simplified by computing only half of the terms.
Proof.
From lemma 1, we have that
(33) |
Introducing
(34) |
eq. 33 simplifies to
(35) |
Using lemma 6, for any ,
( depends on ) so that, defined in eq. 34 now writes
(36) |
where, for any and , we defined
(37) |
We then define the embedding be the tensor product of the . Then, eq. 36, enables to write in eq. 35 as
which is the desired result. ∎
Appendix E Other computation
Lemma 5.
Let be the function defined on with
(38) |
Then, its Chebychev coefficient are given with
(39) |
Proof.
The . The component of a function on the Chebychev basis is given with
which we conveniently rewrite, with the classical change of variable ,
(40) |
which is valid for any interval of length .
Now, for , consider the function defined on with , or equivalently
(41) |
Putting eq. 41 into eq. 40, we obtain
The last term is odd, hence integrate to on an interval centered around . Hence,
(42) |
We recognize the definition of the modified Bessel function of the first kind, defined in eq. 14. Plugging this into eq. 42, we obtain
(43) |
Lemma 6 (Fourier decomposition of Bessel composed with cosine).
Let , and . Then,
(45) | ||||
and by evenness of the coefficients.
Proof.
From the definition of the modified Bessel function of the first kind [Watson, 1922, p.77, Eq. 2], we have
so that
(46) |
Using the change of variable into eq. 46, we see that has the same parity as and
(47) |
Equation 47 can be rewritten
for which eq. 45 is a concise rewriting. ∎