Search | arXiv e-print repository

Structured Prediction in Online Learning

Authors: Pierre Boudart, Alessandro Rudi, Pierre Gaillard

Abstract: We study a theoretical and algorithmic framework for structured prediction in the online learning setting. The problem of structured prediction, i.e. estimating function where the output space lacks a vectorial structure, is well studied in the literature of supervised statistical learning. We show that our algorithm is a generalisation of optimal algorithms from the supervised learning setting, a… ▽ More We study a theoretical and algorithmic framework for structured prediction in the online learning setting. The problem of structured prediction, i.e. estimating function where the output space lacks a vectorial structure, is well studied in the literature of supervised statistical learning. We show that our algorithm is a generalisation of optimal algorithms from the supervised learning setting, and achieves the same excess risk upper bound also when data are not i.i.d. Moreover, we consider a second algorithm designed especially for non-stationary data distributions, including adversarial data. We bound its stochastic regret in function of the variation of the data distributions. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 29 pages

arXiv:2402.09796 [pdf, ps, other]

Closed-form Filtering for Non-linear Systems

Authors: Théophile Cantelobre, Carlo Ciliberto, Benjamin Guedj, Alessandro Rudi

Abstract: Sequential Bayesian Filtering aims to estimate the current state distribution of a Hidden Markov Model, given the past observations. The problem is well-known to be intractable for most application domains, except in notable cases such as the tabular setting or for linear dynamical systems with gaussian noise. In this work, we propose a new class of filters based on Gaussian PSD Models, which offe… ▽ More Sequential Bayesian Filtering aims to estimate the current state distribution of a Hidden Markov Model, given the past observations. The problem is well-known to be intractable for most application domains, except in notable cases such as the tabular setting or for linear dynamical systems with gaussian noise. In this work, we propose a new class of filters based on Gaussian PSD Models, which offer several advantages in terms of density approximation and computational efficiency. We show that filtering can be efficiently performed in closed form when transitions and observations are Gaussian PSD Models. When the transition and observations are approximated by Gaussian PSD Models, we show that our proposed estimator enjoys strong theoretical guarantees, with estimation error that depends on the quality of the approximation and is adaptive to the regularity of the transition probabilities. In particular, we identify regimes in which our proposed filter attains a TV $ε$-error with memory and computational complexity of $O(ε^{-1})$ and $O(ε^{-3/2})$ respectively, including the offline learning step, in contrast to the $O(ε^{-2})$ complexity of sampling methods such as particle filtering. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: 38 pages

arXiv:2401.07734 [pdf, ps, other]

Solving moment and polynomial optimization problems on Sobolev spaces

Authors: Didier Henrion, Alessandro Rudi

Abstract: Using standard tools of harmonic analysis, we state and solve the problem of moments for positive measures supported on the unit ball of a Sobolev space of multivariate periodic trigonometric functions. We describe outer and inner semidefinite approximations of the cone of Sobolev moments. They are the basic components of an infinite-dimensional moment-sums of squares hierarchy, allowing to solve… ▽ More Using standard tools of harmonic analysis, we state and solve the problem of moments for positive measures supported on the unit ball of a Sobolev space of multivariate periodic trigonometric functions. We describe outer and inner semidefinite approximations of the cone of Sobolev moments. They are the basic components of an infinite-dimensional moment-sums of squares hierarchy, allowing to solve numerically non-convex polynomial optimization problems on infinite-dimensional Sobolev spaces, with global convergence guarantees. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2306.14932 [pdf, ps, other]

GloptiNets: Scalable Non-Convex Optimization with Certificates

Authors: Gaspard Beugnot, Julien Mairal, Alessandro Rudi

Abstract: We present a novel approach to non-convex optimization with certificates, which handles smooth functions on the hypercube or on the torus. Unlike traditional methods that rely on algebraic properties, our algorithm exploits the regularity of the target function intrinsic in the decay of its Fourier spectrum. By defining a tractable family of models, we allow at the same time to obtain precise cert… ▽ More We present a novel approach to non-convex optimization with certificates, which handles smooth functions on the hypercube or on the torus. Unlike traditional methods that rely on algebraic properties, our algorithm exploits the regularity of the target function intrinsic in the decay of its Fourier spectrum. By defining a tractable family of models, we allow at the same time to obtain precise certificates and to leverage the advanced and powerful computational techniques developed to optimize neural networks. In this way the scalability of our approach is naturally enhanced by parallel computing with GPUs. Our approach, when applied to the case of polynomials of moderate dimensions but with thousands of coefficients, outperforms the state-of-the-art optimization methods with certificates, as the ones based on Lasserre's hierarchy, addressing problems intractable for the competitors. △ Less

Submitted 20 December, 2023; v1 submitted 26 June, 2023; originally announced June 2023.

Comments: Edit affiliations and acknowledgments

arXiv:2305.15557 [pdf, ps, other]

Non-Parametric Learning of Stochastic Differential Equations with Non-asymptotic Fast Rates of Convergence

Authors: Riccardo Bonalli, Alessandro Rudi

Abstract: We propose a novel non-parametric learning paradigm for the identification of drift and diffusion coefficients of multi-dimensional non-linear stochastic differential equations, which relies upon discrete-time observations of the state. The key idea essentially consists of fitting a RKHS-based approximation of the corresponding Fokker-Planck equation to such observations, yielding theoretical esti… ▽ More We propose a novel non-parametric learning paradigm for the identification of drift and diffusion coefficients of multi-dimensional non-linear stochastic differential equations, which relies upon discrete-time observations of the state. The key idea essentially consists of fitting a RKHS-based approximation of the corresponding Fokker-Planck equation to such observations, yielding theoretical estimates of non-asymptotic learning rates which, unlike previous works, become increasingly tighter when the regularity of the unknown drift and diffusion coefficients becomes higher. Our method being kernel-based, offline pre-processing may be profitably leveraged to enable efficient numerical implementation, offering excellent balance between precision and computational complexity. △ Less

Submitted 23 April, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

arXiv:2303.17109 [pdf, ps, other]

Efficient Sampling of Stochastic Differential Equations with Positive Semi-Definite Models

Authors: Anant Raj, Umut Şimşekli, Alessandro Rudi

Abstract: This paper deals with the problem of efficient sampling from a stochastic differential equation, given the drift function and the diffusion matrix. The proposed approach leverages a recent model for probabilities \cite{rudi2021psd} (the positive semi-definite -- PSD model) from which it is possible to obtain independent and identically distributed (i.i.d.) samples at precision $\varepsilon$ with a… ▽ More This paper deals with the problem of efficient sampling from a stochastic differential equation, given the drift function and the diffusion matrix. The proposed approach leverages a recent model for probabilities \cite{rudi2021psd} (the positive semi-definite -- PSD model) from which it is possible to obtain independent and identically distributed (i.i.d.) samples at precision $\varepsilon$ with a cost that is $m^2 d \log(1/\varepsilon)$ where $m$ is the dimension of the model, $d$ the dimension of the space. The proposed approach consists in: first, computing the PSD model that satisfies the Fokker-Planck equation (or its fractional variant) associated with the SDE, up to error $\varepsilon$, and then sampling from the resulting PSD model. Assuming some regularity of the Fokker-Planck solution (i.e. $β$-times differentiability plus some geometric condition on its zeros) We obtain an algorithm that: (a) in the preparatory phase obtains a PSD model with L2 distance $\varepsilon$ from the solution of the equation, with a model of dimension $m = \varepsilon^{-(d+1)/(β-2s)} (\log(1/\varepsilon))^{d+1}$ where $1/2\leq s\leq1$ is the fractional power to the Laplacian, and total computational complexity of $O(m^{3.5} \log(1/\varepsilon))$ and then (b) for Fokker-Planck equation, it is able to produce i.i.d.\ samples with error $\varepsilon$ in Wasserstein-1 distance, with a cost that is $O(d \varepsilon^{-2(d+1)/β-2} \log(1/\varepsilon)^{2d+3})$ per sample. This means that, if the probability associated with the SDE is somewhat regular, i.e. $β\geq 4d+2$, then the algorithm requires $O(\varepsilon^{-0.88} \log(1/\varepsilon)^{4.5d})$ in the preparatory phase, and $O(\varepsilon^{-1/2}\log(1/\varepsilon)^{2d+2})$ for each sample. Our results suggest that as the true solution gets smoother, we can circumvent the curse of dimensionality without requiring any sort of convexity. △ Less

Submitted 24 May, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

arXiv:2301.06339 [pdf, other]

Approximation of optimization problems with constraints through kernel Sum-Of-Squares

Authors: Pierre-Cyril Aubin-Frankowski, Alessandro Rudi

Abstract: Handling an infinite number of inequality constraints in infinite-dimensional spaces occurs in many fields, from global optimization to optimal transport. These problems have been tackled individually in several previous articles through kernel Sum-Of-Squares (kSoS) approximations. We propose here a unified theorem to prove convergence guarantees for these schemes. Pointwise inequalities are turne… ▽ More Handling an infinite number of inequality constraints in infinite-dimensional spaces occurs in many fields, from global optimization to optimal transport. These problems have been tackled individually in several previous articles through kernel Sum-Of-Squares (kSoS) approximations. We propose here a unified theorem to prove convergence guarantees for these schemes. Pointwise inequalities are turned into equalities within a class of nonnegative kSoS functions. Assuming further that the functions appearing in the problem are smooth, focusing on pointwise equality constraints enables the use of scattering inequalities to mitigate the curse of dimensionality in sampling the constraints. Our approach is illustrated in learning vector fields with side information, here the invariance of a set. △ Less

Submitted 21 February, 2024; v1 submitted 16 January, 2023; originally announced January 2023.

MSC Class: 46E22; 46N10; 90C26

arXiv:2301.04350 [pdf, ps, other]

Maximum Centre-Disjoint Mergeable Disks

Authors: Ali Gholami Rudi

Abstract: Given a set of disks on the plane, the goal of the problem studied in this paper is to choose a subset of these disks such that none of its members contains the centre of any other. Each disk not in this subset must be merged with one of its nearby disks that is, increasing the latter's radius. We prove that this problem is NP-hard. We also present polynomial-time algorithms for the special case i… ▽ More Given a set of disks on the plane, the goal of the problem studied in this paper is to choose a subset of these disks such that none of its members contains the centre of any other. Each disk not in this subset must be merged with one of its nearby disks that is, increasing the latter's radius. We prove that this problem is NP-hard. We also present polynomial-time algorithms for the special case in which the centres of all disks are on a line. △ Less

Submitted 11 January, 2023; originally announced January 2023.

MSC Class: 68u05 ACM Class: F.2.2

arXiv:2211.08958 [pdf, other]

Vector-Valued Least-Squares Regression under Output Regularity Assumptions

Authors: Luc Brogat-Motte, Alessandro Rudi, Céline Brouard, Juho Rousu, Florence d'Alché-Buc

Abstract: We propose and analyse a reduced-rank method for solving least-squares regression problems with infinite dimensional output. We derive learning bounds for our method, and study under which setting statistical performance is improved in comparison to full-rank method. Our analysis extends the interest of reduced-rank regression beyond the standard low-rank setting to more general output regularity… ▽ More We propose and analyse a reduced-rank method for solving least-squares regression problems with infinite dimensional output. We derive learning bounds for our method, and study under which setting statistical performance is improved in comparison to full-rank method. Our analysis extends the interest of reduced-rank regression beyond the standard low-rank setting to more general output regularity assumptions. We illustrate our theoretical insights on synthetic least-squares problems. Then, we propose a surrogate structured prediction method derived from this reduced-rank method. We assess its benefits on three different problems: image reconstruction, multi-label classification, and metabolite identification. △ Less

Submitted 16 November, 2022; originally announced November 2022.

arXiv:2211.04889 [pdf, other]

Exponential convergence of sum-of-squares hierarchies for trigonometric polynomials

Authors: Francis Bach, Alessandro Rudi

Abstract: We consider the unconstrained optimization of multivariate trigonometric polynomials by the sum-of-squares hierarchy of lower bounds. We first show a convergence rate of $O(1/s^2)$ for the relaxation with degree $s$ without any assumption on the trigonometric polynomial to minimize. Second, when the polynomial has a finite number of global minimizers with invertible Hessians at these minimizers, w… ▽ More We consider the unconstrained optimization of multivariate trigonometric polynomials by the sum-of-squares hierarchy of lower bounds. We first show a convergence rate of $O(1/s^2)$ for the relaxation with degree $s$ without any assumption on the trigonometric polynomial to minimize. Second, when the polynomial has a finite number of global minimizers with invertible Hessians at these minimizers, we show an exponential convergence rate with explicit constants. Our results also apply to minimizing regular multivariate polynomials on the hypercube. △ Less

Submitted 18 April, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

Journal ref: SIAM Journal on Optimization, In press

arXiv:2205.13255 [pdf, other]

Active Labeling: Streaming Stochastic Gradients

Authors: Vivien Cabannes, Francis Bach, Vianney Perchet, Alessandro Rudi

Abstract: The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which focuses on active learning… ▽ More The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which focuses on active learning with partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over the number of samples. We illustrate our technique in depth for robust regression. △ Less

Submitted 7 December, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

Comments: 38 pages (9 main pages), 9 figures

MSC Class: 68T37 ACM Class: G.3

arXiv:2204.04970 [pdf, other]

Non-Convex Optimization with Certificates and Fast Rates Through Kernel Sums of Squares

Authors: Blake Woodworth, Francis Bach, Alessandro Rudi

Abstract: We consider potentially non-convex optimization problems, for which optimal rates of approximation depend on the dimension of the parameter space and the smoothness of the function to be optimized. In this paper, we propose an algorithm that achieves close to optimal a priori computational guarantees, while also providing a posteriori certificates of optimality. Our general formulation builds on i… ▽ More We consider potentially non-convex optimization problems, for which optimal rates of approximation depend on the dimension of the parameter space and the smoothness of the function to be optimized. In this paper, we propose an algorithm that achieves close to optimal a priori computational guarantees, while also providing a posteriori certificates of optimality. Our general formulation builds on infinite-dimensional sums-of-squares and Fourier analysis, and is instantiated on the minimization of multivariate periodic functions. △ Less

Submitted 11 April, 2022; originally announced April 2022.

arXiv:2202.13733 [pdf, other]

On the Benefits of Large Learning Rates for Kernel Methods

Authors: Gaspard Beugnot, Julien Mairal, Alessandro Rudi

Abstract: This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consid… ▽ More This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stop**, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian's eigenvectors. This extends an intuition described by Nakkiran (2020) on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions. △ Less

Submitted 3 June, 2022; v1 submitted 28 February, 2022; originally announced February 2022.

Comments: Accepted paper at Conference COLT 2022. To be published to Proceedings of Machine Learning Research (PMLR)

arXiv:2202.13729 [pdf, other]

Second order conditions to decompose smooth functions as sums of squares

Authors: Ulysse Marteau-Ferey, Francis Bach, Alessandro Rudi

Abstract: We consider the problem of decomposing a regular non-negative function as a sum of squares of functions which preserve some form of regularity. In the same way as decomposing non-negative polynomials as sum of squares of polynomials allows to derive methods in order to solve global optimization problems on polynomials, decomposing a regular function as a sum of squares allows to derive methods to… ▽ More We consider the problem of decomposing a regular non-negative function as a sum of squares of functions which preserve some form of regularity. In the same way as decomposing non-negative polynomials as sum of squares of polynomials allows to derive methods in order to solve global optimization problems on polynomials, decomposing a regular function as a sum of squares allows to derive methods to solve global optimization problems on more general functions. As the regularity of the functions in the sum of squares decomposition is a key indicator in analyzing the convergence and speed of convergence of optimization methods, it is important to have theoretical results guaranteeing such a regularity. In this work, we show second order sufficient conditions in order for a $p$ times continuously differentiable non-negative function to be a sum of squares of $p-2$ differentiable functions. The main hypothesis is that, locally, the function grows quadratically in directions which are orthogonal to its set of zeros. The novelty of this result, compared to previous works is that it allows sets of zeros which are continuous as opposed to discrete, and also applies to manifolds as opposed to open sets of $\R^d$. This has applications in problems where manifolds of minimizers or zeros typically appear, such as in optimal transport, and for minimizing functions defined on manifolds. △ Less

Submitted 28 February, 2022; originally announced February 2022.

arXiv:2202.05614 [pdf, other]

Measuring dissimilarity with diffeomorphism invariance

Authors: Théophile Cantelobre, Carlo Ciliberto, Benjamin Guedj, Alessandro Rudi

Abstract: Measures of similarity (or dissimilarity) are a key ingredient to many machine learning algorithms. We introduce DID, a pairwise dissimilarity measure applicable to a wide range of data spaces, which leverages the data's internal structure to be invariant to diffeomorphisms. We prove that DID enjoys properties which make it relevant for theoretical study and practical use. By representing each dat… ▽ More Measures of similarity (or dissimilarity) are a key ingredient to many machine learning algorithms. We introduce DID, a pairwise dissimilarity measure applicable to a wide range of data spaces, which leverages the data's internal structure to be invariant to diffeomorphisms. We prove that DID enjoys properties which make it relevant for theoretical study and practical use. By representing each datum as a function, DID is defined as the solution to an optimization problem in a Reproducing Kernel Hilbert Space and can be expressed in closed-form. In practice, it can be efficiently approximated via Nyström sampling. Empirical experiments support the merits of DID. △ Less

Submitted 7 March, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

Comments: A pre-print

arXiv:2201.13055 [pdf]

Nyström Kernel Mean Embeddings

Authors: Antoine Chatalic, Nicolas Schreuder, Alessandro Rudi, Lorenzo Rosasco

Abstract: Kernel mean embeddings are a powerful tool to represent probability distributions over arbitrary spaces as single points in a Hilbert space. Yet, the cost of computing and storing such embeddings prohibits their direct use in large-scale settings. We propose an efficient approximation procedure based on the Nyström method, which exploits a small random subset of the dataset. Our main result is an… ▽ More Kernel mean embeddings are a powerful tool to represent probability distributions over arbitrary spaces as single points in a Hilbert space. Yet, the cost of computing and storing such embeddings prohibits their direct use in large-scale settings. We propose an efficient approximation procedure based on the Nyström method, which exploits a small random subset of the dataset. Our main result is an upper bound on the approximation error of this procedure. It yields sufficient conditions on the subsample size to obtain the standard $n^{-1/2}$ rate while reducing computational costs. We discuss applications of this result for the approximation of the maximum mean discrepancy and quadrature rules, and illustrate our theoretical findings with numerical experiments. △ Less

Submitted 15 June, 2022; v1 submitted 31 January, 2022; originally announced January 2022.

Comments: 8 pages

Journal ref: ICML 2022

arXiv:2112.01907 [pdf, other]

Near-optimal estimation of smooth transport maps with kernel sums-of-squares

Authors: Boris Muzellec, Adrien Vacher, Francis Bach, François-Xavier Vialard, Alessandro Rudi

Abstract: It was recently shown that under smoothness conditions, the squared Wasserstein distance between two distributions could be efficiently computed with appealing statistical error upper bounds. However, rather than the distance itself, the object of interest for applications such as generative modeling is the underlying optimal transport map. Hence, computational and statistical guarantees need to b… ▽ More It was recently shown that under smoothness conditions, the squared Wasserstein distance between two distributions could be efficiently computed with appealing statistical error upper bounds. However, rather than the distance itself, the object of interest for applications such as generative modeling is the underlying optimal transport map. Hence, computational and statistical guarantees need to be obtained for the estimated maps themselves. In this paper, we propose the first tractable algorithm for which the statistical $L^2$ error on the maps nearly matches the existing minimax lower-bounds for smooth map estimation. Our method is based on solving the semi-dual formulation of optimal transport with an infinite-dimensional sum-of-squares reformulation, and leads to an algorithm which has dimension-free polynomial rates in the number of samples, with potentially exponentially dimension-dependent constants. △ Less

Submitted 29 December, 2021; v1 submitted 3 December, 2021; originally announced December 2021.

arXiv:2111.11306 [pdf, other]

Learning PSD-valued functions using kernel sums-of-squares

Authors: Boris Muzellec, Francis Bach, Alessandro Rudi

Abstract: Shape constraints such as positive semi-definiteness (PSD) for matrices or convexity for functions play a central role in many applications in machine learning and sciences, including metric learning, optimal transport, and economics. Yet, very few function models exist that enforce PSD-ness or convexity with good empirical performance and theoretical guarantees. In this paper, we introduce a kern… ▽ More Shape constraints such as positive semi-definiteness (PSD) for matrices or convexity for functions play a central role in many applications in machine learning and sciences, including metric learning, optimal transport, and economics. Yet, very few function models exist that enforce PSD-ness or convexity with good empirical performance and theoretical guarantees. In this paper, we introduce a kernel sum-of-squares model for functions that take values in the PSD cone, which extends kernel sums-of-squares models that were recently proposed to encode non-negative scalar functions. We provide a representer theorem for this class of PSD functions, show that it constitutes a universal approximator of PSD functions, and derive eigenvalue bounds in the case of subsampled equality constraints. We then apply our results to modeling convex functions, by enforcing a kernel sum-of-squares representation of their Hessian, and show that any smooth and strongly convex function may be thus represented. Finally, we illustrate our methods on a PSD matrix-valued regression task, and on scalar-valued convex regression. △ Less

Submitted 24 January, 2022; v1 submitted 22 November, 2021; originally announced November 2021.

arXiv:2110.10527 [pdf, other]

Sampling from Arbitrary Functions via PSD Models

Authors: Ulysse Marteau-Ferey, Francis Bach, Alessandro Rudi

Abstract: In many areas of applied statistics and machine learning, generating an arbitrary number of independent and identically distributed (i.i.d.) samples from a given distribution is a key task. When the distribution is known only through evaluations of the density, current methods either scale badly with the dimension or require very involved implementations. Instead, we take a two-step approach by fi… ▽ More In many areas of applied statistics and machine learning, generating an arbitrary number of independent and identically distributed (i.i.d.) samples from a given distribution is a key task. When the distribution is known only through evaluations of the density, current methods either scale badly with the dimension or require very involved implementations. Instead, we take a two-step approach by first modeling the probability distribution and then sampling from that model. We use the recently introduced class of positive semi-definite (PSD) models, which have been shown to be efficient for approximating probability densities. We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models. We also present preliminary empirical results to illustrate our assertions. △ Less

Submitted 28 October, 2021; v1 submitted 20 October, 2021; originally announced October 2021.

arXiv:2110.07396 [pdf, other]

Infinite-Dimensional Sums-of-Squares for Optimal Control

Authors: Eloïse Berthier, Justin Carpentier, Alessandro Rudi, Francis Bach

Abstract: We introduce an approximation method to solve an optimal control problem via the Lagrange dual of its weak formulation. It is based on a sum-of-squares representation of the Hamiltonian, and extends a previous method from polynomial optimization to the generic case of smooth problems. Such a representation is infinite-dimensional and relies on a particular space of functions-a reproducing kernel H… ▽ More We introduce an approximation method to solve an optimal control problem via the Lagrange dual of its weak formulation. It is based on a sum-of-squares representation of the Hamiltonian, and extends a previous method from polynomial optimization to the generic case of smooth problems. Such a representation is infinite-dimensional and relies on a particular space of functions-a reproducing kernel Hilbert space-chosen to fit the structure of the control problem. After subsampling, it leads to a practical method that amounts to solving a semi-definite program. We illustrate our approach by a numerical application on a simple low-dimensional control problem. △ Less

Submitted 14 October, 2021; originally announced October 2021.

arXiv:2110.03960 [pdf, other]

Mixability made efficient: Fast online multiclass logistic regression

Authors: Rémi Jézéquel, Pierre Gaillard, Alessandro Rudi

Abstract: Mixability has been shown to be a powerful tool to obtain algorithms with optimal regret. However, the resulting methods often suffer from high computational complexity which has reduced their practical applicability. For example, in the case of multiclass logistic regression, the aggregating forecaster (Foster et al. (2018)) achieves a regret of $O(\log(Bn))$ whereas Online Newton Step achieves… ▽ More Mixability has been shown to be a powerful tool to obtain algorithms with optimal regret. However, the resulting methods often suffer from high computational complexity which has reduced their practical applicability. For example, in the case of multiclass logistic regression, the aggregating forecaster (Foster et al. (2018)) achieves a regret of $O(\log(Bn))$ whereas Online Newton Step achieves $O(e^B\log(n))$ obtaining a double exponential gain in $B$ (a bound on the norm of comparative functions). However, this high statistical performance is at the price of a prohibitive computational complexity $O(n^{37})$. △ Less

Submitted 8 October, 2021; originally announced October 2021.

arXiv:2106.16116 [pdf, ps, other]

PSD Representations for Effective Probability Models

Authors: Alessandro Rudi, Carlo Ciliberto

Abstract: Finding a good way to model probability densities is key to probabilistic inference. An ideal model should be able to concisely approximate any probability while being also compatible with two main operations: multiplications of two models (product rule) and marginalization with respect to a subset of the random variables (sum rule). In this work, we show that a recently proposed class of positive… ▽ More Finding a good way to model probability densities is key to probabilistic inference. An ideal model should be able to concisely approximate any probability while being also compatible with two main operations: multiplications of two models (product rule) and marginalization with respect to a subset of the random variables (sum rule). In this work, we show that a recently proposed class of positive semi-definite (PSD) models for non-negative functions is particularly suited to this end. In particular, we characterize both approximation and generalization capabilities of PSD models, showing that they enjoy strong theoretical guarantees. Moreover, we show that we can perform efficiently both sum and product rule in closed form via matrix operations, enjoying the same versatility of mixture models. Our results open the way to applications of PSD models to density estimation, decision theory and inference. △ Less

Submitted 24 November, 2021; v1 submitted 30 June, 2021; originally announced June 2021.

Comments: 50 pages, 1 table

arXiv:2106.09994 [pdf, other]

A Note on Optimizing Distributions using Kernel Mean Embeddings

Authors: Boris Muzellec, Francis Bach, Alessandro Rudi

Abstract: Kernel mean embeddings are a popular tool that consists in representing probability measures by their infinite-dimensional mean embeddings in a reproducing kernel Hilbert space. When the kernel is characteristic, mean embeddings can be used to define a distance between probability measures, known as the maximum mean discrepancy (MMD). A well-known advantage of mean embeddings and MMD is their low… ▽ More Kernel mean embeddings are a popular tool that consists in representing probability measures by their infinite-dimensional mean embeddings in a reproducing kernel Hilbert space. When the kernel is characteristic, mean embeddings can be used to define a distance between probability measures, known as the maximum mean discrepancy (MMD). A well-known advantage of mean embeddings and MMD is their low computational cost and low sample complexity. However, kernel mean embeddings have had limited applications to problems that consist in optimizing distributions, due to the difficulty of characterizing which Hilbert space vectors correspond to a probability distribution. In this note, we propose to leverage the kernel sums-of-squares parameterization of positive functions of Marteau-Ferey et al. [2020] to fit distributions in the MMD geometry. First, we show that when the kernel is characteristic, distributions with a kernel sum-of-squares density are dense. Then, we provide algorithms to optimize such distributions in the finite-sample setting, which we illustrate in a density fitting numerical experiment. △ Less

Submitted 27 June, 2021; v1 submitted 18 June, 2021; originally announced June 2021.

arXiv:2106.08855 [pdf, other]

Beyond Tikhonov: Faster Learning with Self-Concordant Losses via Iterative Regularization

Authors: Gaspard Beugnot, Julien Mairal, Alessandro Rudi

Abstract: The theory of spectral filtering is a remarkable tool to understand the statistical properties of learning with kernels. For least squares, it allows to derive various regularization schemes that yield faster convergence rates of the excess risk than with Tikhonov regularization. This is typically achieved by leveraging classical assumptions called source and capacity conditions, which characteriz… ▽ More The theory of spectral filtering is a remarkable tool to understand the statistical properties of learning with kernels. For least squares, it allows to derive various regularization schemes that yield faster convergence rates of the excess risk than with Tikhonov regularization. This is typically achieved by leveraging classical assumptions called source and capacity conditions, which characterize the difficulty of the learning task. In order to understand estimators derived from other loss functions, Marteau-Ferey et al. have extended the theory of Tikhonov regularization to generalized self concordant loss functions (GSC), which contain, e.g., the logistic loss. In this paper, we go a step further and show that fast and optimal rates can be achieved for GSC by using the iterated Tikhonov regularization scheme, which is intrinsically related to the proximal point method in optimization, and overcomes the limitation of the classical Tikhonov regularization. △ Less

Submitted 10 November, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

Comments: To be published in NeurIPS 2021

arXiv:2105.15069 [pdf, other]

On the Consistency of Max-Margin Losses

Authors: Alex Nowak-Vila, Alessandro Rudi, Francis Bach

Abstract: The foundational concept of Max-Margin in machine learning is ill-posed for output spaces with more than two labels such as in structured prediction. In this paper, we show that the Max-Margin loss can only be consistent to the classification task under highly restrictive assumptions on the discrete loss measuring the error between outputs. These conditions are satisfied by distances defined in tr… ▽ More The foundational concept of Max-Margin in machine learning is ill-posed for output spaces with more than two labels such as in structured prediction. In this paper, we show that the Max-Margin loss can only be consistent to the classification task under highly restrictive assumptions on the discrete loss measuring the error between outputs. These conditions are satisfied by distances defined in tree graphs, for which we prove consistency, thus being the first losses shown to be consistent for Max-Margin beyond the binary setting. We finally address these limitations by correcting the concept of Max-Margin and introducing the Restricted-Max-Margin, where the maximization of the loss-augmented scores is maintained, but performed over a subset of the original domain. The resulting loss is also a generalization of the binary support vector machine and it is consistent under milder conditions on the discrete loss. △ Less

Submitted 21 March, 2022; v1 submitted 31 May, 2021; originally announced May 2021.

arXiv:2102.03594 [pdf, other]

Online nonparametric regression with Sobolev kernels

Authors: Oleksandr Zadorozhnyi, Pierre Gaillard, Sebastien Gerschinovitz, Alessandro Rudi

Abstract: In this work we investigate the variation of the online kernelized ridge regression algorithm in the setting of $d-$dimensional adversarial nonparametric regression. We derive the regret upper bounds on the classes of Sobolev spaces $W_{p}^β(\mathcal{X})$, $p\geq 2, β>\frac{d}{p}$. The upper bounds are supported by the minimax regret analysis, which reveals that in the cases $β> \frac{d}{2}$ or… ▽ More In this work we investigate the variation of the online kernelized ridge regression algorithm in the setting of $d-$dimensional adversarial nonparametric regression. We derive the regret upper bounds on the classes of Sobolev spaces $W_{p}^β(\mathcal{X})$, $p\geq 2, β>\frac{d}{p}$. The upper bounds are supported by the minimax regret analysis, which reveals that in the cases $β> \frac{d}{2}$ or $p=\infty$ these rates are (essentially) optimal. Finally, we compare the performance of the kernelized ridge regression forecaster to the known non-parametric forecasters in terms of the regret rates and their computational complexity as well as to the excess risk rates in the setting of statistical (i.i.d.) nonparametric regression. △ Less

Submitted 13 July, 2021; v1 submitted 6 February, 2021; originally announced February 2021.

Comments: 40 pages, 5 figures, 3 tables (version 2)

arXiv:2102.02789 [pdf, other]

Disambiguation of weak supervision with exponential convergence rates

Authors: Vivien Cabannes, Francis Bach, Alessandro Rudi

Abstract: Machine learning approached through supervised learning requires expensive annotation of data. This motivates weakly supervised learning, where data are annotated with incomplete yet discriminative information. In this paper, we focus on partial labelling, an instance of weak supervision where, from a given input, we are given a set of potential targets. We review a disambiguation principle to rec… ▽ More Machine learning approached through supervised learning requires expensive annotation of data. This motivates weakly supervised learning, where data are annotated with incomplete yet discriminative information. In this paper, we focus on partial labelling, an instance of weak supervision where, from a given input, we are given a set of potential targets. We review a disambiguation principle to recover full supervision from weak supervision, and propose an empirical disambiguation algorithm. We prove exponential convergence rates of our algorithm under classical learnability assumptions, and we illustrate the usefulness of our method on practical examples. △ Less

Submitted 15 July, 2021; v1 submitted 4 February, 2021; originally announced February 2021.

Comments: 22 pages; 6 figures

MSC Class: 68Q32 ACM Class: I.2.6; G.3; F.2.2

Journal ref: Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021

arXiv:2102.00760 [pdf, ps, other]

Fast rates in structured prediction

Authors: Vivien Cabannes, Alessandro Rudi, Francis Bach

Abstract: Discrete supervised learning problems such as classification are often tackled by introducing a continuous surrogate problem akin to regression. Bounding the original error, between estimate and solution, by the surrogate error endows discrete problems with convergence rates already shown for continuous instances. Yet, current approaches do not leverage the fact that discrete problems are essentia… ▽ More Discrete supervised learning problems such as classification are often tackled by introducing a continuous surrogate problem akin to regression. Bounding the original error, between estimate and solution, by the surrogate error endows discrete problems with convergence rates already shown for continuous instances. Yet, current approaches do not leverage the fact that discrete problems are essentially predicting a discrete output when continuous problems are predicting a continuous value. In this paper, we tackle this issue for general structured prediction problems, opening the way to "super fast" rates, that is, convergence rates for the excess risk faster than $n^{-1}$, where $n$ is the number of observations, with even exponential rates with the strongest assumptions. We first illustrate it for predictors based on nearest neighbors, generalizing rates known for binary classification to any discrete problem within the framework of structured prediction. We then consider kernel ridge regression where we improve known rates in $n^{-1/4}$ to arbitrarily fast rates, depending on a parameter characterizing the hardness of the problem, thus allowing, under smoothness assumptions, to bypass the curse of dimensionality. △ Less

Submitted 15 July, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

Comments: 14 main pages, 3 main figures, 43 pages, 4 figures (with appendix)

MSC Class: 68T05 ACM Class: I.2.6; F.2.2; G.3

Journal ref: Conference on Learning Theory, PMLR 134, 2021

arXiv:2101.05380 [pdf, other]

A Dimension-free Computational Upper-bound for Smooth Optimal Transport Estimation

Authors: Adrien Vacher, Boris Muzellec, Alessandro Rudi, Francis Bach, Francois-Xavier Vialard

Abstract: It is well-known that plug-in statistical estimation of optimal transport suffers from the curse of dimensionality. Despite recent efforts to improve the rate of estimation with the smoothness of the problem, the computational complexity of these recently proposed methods still degrades exponentially with the dimension. In this paper, thanks to an infinite-dimensional sum-of-squares representation… ▽ More It is well-known that plug-in statistical estimation of optimal transport suffers from the curse of dimensionality. Despite recent efforts to improve the rate of estimation with the smoothness of the problem, the computational complexity of these recently proposed methods still degrades exponentially with the dimension. In this paper, thanks to an infinite-dimensional sum-of-squares representation, we derive a statistical estimator of smooth optimal transport which achieves a precision $\varepsilon$ from $\tilde{O}(\varepsilon^{-2})$ independent and identically distributed samples from the distributions, for a computational cost of $\tilde{O}(\varepsilon^{-4})$ when the smoothness increases, hence yielding dimension-free statistical and computational rates, with potentially exponentially dimension-dependent constants. △ Less

Submitted 1 October, 2021; v1 submitted 13 January, 2021; originally announced January 2021.

Comments: 30 pages

MSC Class: 62G05

arXiv:2012.11978 [pdf, ps, other]

Finding Global Minima via Kernel Approximations

Authors: Alessandro Rudi, Ulysse Marteau-Ferey, Francis Bach

Abstract: We consider the global minimization of smooth functions based solely on function evaluations. Algorithms that achieve the optimal number of function evaluations for a given precision level typically rely on explicitly constructing an approximation of the function which is then minimized with algorithms that have exponential running-time complexity. In this paper, we consider an approach that joint… ▽ More We consider the global minimization of smooth functions based solely on function evaluations. Algorithms that achieve the optimal number of function evaluations for a given precision level typically rely on explicitly constructing an approximation of the function which is then minimized with algorithms that have exponential running-time complexity. In this paper, we consider an approach that jointly models the function to approximate and finds a global minimum. This is done by using infinite sums of square smooth functions and has strong links with polynomial sum-of-squares hierarchies. Leveraging recent representation properties of reproducing kernel Hilbert spaces, the infinite-dimensional optimization problem can be solved by subsampling in time polynomial in the number of function evaluations, and with theoretical guarantees on the obtained minimum. Given $n$ samples, the computational cost is $O(n^{3.5})$ in time, $O(n^2)$ in space, and we achieve a convergence rate to the global optimum that is $O(n^{-m/d + 1/2 + 3/d})$ where $m$ is the degree of differentiability of the function and $d$ the number of dimensions. The rate is nearly optimal in the case of Sobolev functions and more generally makes the proposed method particularly suitable for functions that have a large number of derivatives. Indeed, when $m$ is in the order of $d$, the convergence rate to the global optimum does not suffer from the curse of dimensionality, which affects only the worst-case constants (that we track explicitly through the paper). △ Less

Submitted 22 December, 2020; originally announced December 2020.

arXiv:2009.04324 [pdf, other]

Overcoming the curse of dimensionality with Laplacian regularization in semi-supervised learning

Authors: Vivien Cabannes, Loucas Pillaud-Vivien, Francis Bach, Alessandro Rudi

Abstract: As annotations of data can be scarce in large-scale practical problems, leveraging unlabelled examples is one of the most important aspects of machine learning. This is the aim of semi-supervised learning. To benefit from the access to unlabelled data, it is natural to diffuse smoothly knowledge of labelled data to unlabelled one. This induces to the use of Laplacian regularization. Yet, current i… ▽ More As annotations of data can be scarce in large-scale practical problems, leveraging unlabelled examples is one of the most important aspects of machine learning. This is the aim of semi-supervised learning. To benefit from the access to unlabelled data, it is natural to diffuse smoothly knowledge of labelled data to unlabelled one. This induces to the use of Laplacian regularization. Yet, current implementations of Laplacian regularization suffer from several drawbacks, notably the well-known curse of dimensionality. In this paper, we provide a statistical analysis to overcome those issues, and unveil a large body of spectral filtering methods that exhibit desirable behaviors. They are implemented through (reproducing) kernel methods, for which we provide realistic computational guidelines in order to make our method usable with large amounts of data. △ Less

Submitted 29 November, 2021; v1 submitted 9 September, 2020; originally announced September 2020.

Comments: 38 pages, 6 figures

Journal ref: NeurIPS 2021

arXiv:2007.14703 [pdf, other]

Learning Output Embeddings in Structured Prediction

Authors: Luc Brogat-Motte, Alessandro Rudi, Céline Brouard, Juho Rousu, Florence d'Alché-Buc

Abstract: A powerful and flexible approach to structured prediction consists in embedding the structured objects to be predicted into a feature space of possibly infinite dimension by means of output kernels, and then, solving a regression problem in this output space. A prediction in the original space is computed by solving a pre-image problem. In such an approach, the embedding, linked to the target loss… ▽ More A powerful and flexible approach to structured prediction consists in embedding the structured objects to be predicted into a feature space of possibly infinite dimension by means of output kernels, and then, solving a regression problem in this output space. A prediction in the original space is computed by solving a pre-image problem. In such an approach, the embedding, linked to the target loss, is defined prior to the learning phase. In this work, we propose to jointly learn a finite approximation of the output embedding and the regression function into the new feature space. For that purpose, we leverage a priori information on the outputs and also unexploited unsupervised output data, which are both often available in structured prediction problems. We prove that the resulting structured predictor is a consistent estimator, and derive an excess risk bound. Moreover, the novel structured prediction tool enjoys a significantly smaller computational complexity than former output kernel methods. The approach empirically tested on various structured prediction problems reveals to be versatile and able to handle large datasets. △ Less

Submitted 2 November, 2020; v1 submitted 29 July, 2020; originally announced July 2020.

arXiv:2007.03926 [pdf, other]

Non-parametric Models for Non-negative Functions

Authors: Ulysse Marteau-Ferey, Francis Bach, Alessandro Rudi

Abstract: Linear models have shown great effectiveness and flexibility in many fields such as machine learning, signal processing and statistics. They can represent rich spaces of functions while preserving the convexity of the optimization problems where they are used, and are simple to evaluate, differentiate and integrate. However, for modeling non-negative functions, which are crucial for unsupervised… ▽ More Linear models have shown great effectiveness and flexibility in many fields such as machine learning, signal processing and statistics. They can represent rich spaces of functions while preserving the convexity of the optimization problems where they are used, and are simple to evaluate, differentiate and integrate. However, for modeling non-negative functions, which are crucial for unsupervised learning, density estimation, or non-parametric Bayesian methods, linear models are not applicable directly. Moreover, current state-of-the-art models like generalized linear models either lead to non-convex optimization problems, or cannot be easily integrated. In this paper we provide the first model for non-negative functions which benefits from the same good properties of linear models. In particular, we prove that it admits a representer theorem and provide an efficient dual formulation for convex problems. We study its representation power, showing that the resulting space of functions is strictly richer than that of generalized linear models. Finally we extend the model and the theoretical results to functions with outputs in convex cones. The paper is complemented by an experimental evaluation of the model showing its effectiveness in terms of formulation, algorithmic derivation and practical results on the problems of density estimation, regression with heteroscedastic errors, and multiple quantile regression. △ Less

Submitted 8 July, 2020; originally announced July 2020.

arXiv:2007.01012 [pdf, other]

Consistent Structured Prediction with Max-Min Margin Markov Networks

Authors: Alex Nowak-Vila, Francis Bach, Alessandro Rudi

Abstract: Max-margin methods for binary classification such as the support vector machine (SVM) have been extended to the structured prediction setting under the name of max-margin Markov networks ($M^3N$), or more generally structural SVMs. Unfortunately, these methods are statistically inconsistent when the relationship between inputs and labels is far from deterministic. We overcome such limitations by d… ▽ More Max-margin methods for binary classification such as the support vector machine (SVM) have been extended to the structured prediction setting under the name of max-margin Markov networks ($M^3N$), or more generally structural SVMs. Unfortunately, these methods are statistically inconsistent when the relationship between inputs and labels is far from deterministic. We overcome such limitations by defining the learning problem in terms of a "max-min" margin formulation, naming the resulting method max-min margin Markov networks ($M^4N$). We prove consistency and finite sample generalization bounds for $M^4N$ and provide an explicit algorithm to compute the estimator. The algorithm achieves a generalization error of $O(1/\sqrt{n})$ for a total cost of $O(n)$ projection-oracle calls (which have at most the same cost as the max-oracle from $M^3N$). Experiments on multi-class classification, ordinal regression, sequence prediction and ranking demonstrate the effectiveness of the proposed method. △ Less

Submitted 27 July, 2020; v1 submitted 2 July, 2020; originally announced July 2020.

arXiv:2006.10350 [pdf, other]

Kernel methods through the roof: handling billions of points efficiently

Authors: Giacomo Meanti, Luigi Carratino, Lorenzo Rosasco, Alessandro Rudi

Abstract: Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems, since naïve implementations scale poorly with data size. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to dev… ▽ More Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems, since naïve implementations scale poorly with data size. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware. Towards this end, we designed a preconditioned gradient solver for kernel methods exploiting both GPU acceleration and parallelization with multiple GPUs, implementing out-of-core variants of common linear algebra operations to guarantee optimal hardware utilization. Further, we optimize the numerical precision of different operations and maximize efficiency of matrix-vector multiplications. As a result we can experimentally show dramatic speedups on datasets with billions of points, while still guaranteeing state of the art performance. Additionally, we make our software available as an easy to use library. △ Less

Submitted 26 November, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

Comments: 33 pages, 7 figures, NeurIPS 2020

arXiv:2006.09984

Interpolation and Learning with Scale Dependent Kernels

Authors: Nicolò Pagliana, Alessandro Rudi, Ernesto De Vito, Lorenzo Rosasco

Abstract: We study the learning properties of nonparametric ridge-less least squares. In particular, we consider the common case of estimators defined by scale dependent kernels, and focus on the role of the scale. These estimators interpolate the data and the scale can be shown to control their stability through the condition number. Our analysis shows that are different regimes depending on the interplay… ▽ More We study the learning properties of nonparametric ridge-less least squares. In particular, we consider the common case of estimators defined by scale dependent kernels, and focus on the role of the scale. These estimators interpolate the data and the scale can be shown to control their stability through the condition number. Our analysis shows that are different regimes depending on the interplay between the sample size, its dimensions, and the smoothness of the problem. Indeed, when the sample size is less than exponential in the data dimension, then the scale can be chosen so that the learning error decreases. As the sample size becomes larger, the overall error stop decreasing but interestingly the scale can be chosen in such a way that the variance due to noise remains bounded. Our analysis combines, probabilistic results with a number of analytic techniques from interpolation theory. △ Less

Submitted 10 November, 2021; v1 submitted 17 June, 2020; originally announced June 2020.

Comments: The paper is not completed and contains parts which need to be modified

arXiv:2006.09261 [pdf, other]

Structured and Localized Image Restoration

Authors: Thomas Eboli, Alex Nowak-Vila, Jian Sun, Francis Bach, Jean Ponce, Alessandro Rudi

Abstract: We present a novel approach to image restoration that leverages ideas from localized structured prediction and non-linear multi-task learning. We optimize a penalized energy function regularized by a sum of terms measuring the distance between patches to be restored and clean patches from an external database gathered beforehand. The resulting estimator comes with strong statistical guarantees lev… ▽ More We present a novel approach to image restoration that leverages ideas from localized structured prediction and non-linear multi-task learning. We optimize a penalized energy function regularized by a sum of terms measuring the distance between patches to be restored and clean patches from an external database gathered beforehand. The resulting estimator comes with strong statistical guarantees leveraging local dependency properties of overlap** patches. We derive the corresponding algorithms for energies based on the mean-squared and Euclidean norm errors. Finally, we demonstrate the practical effectiveness of our model on different image restoration problems using standard benchmarks. △ Less

Submitted 16 June, 2020; originally announced June 2020.

arXiv:2003.08109 [pdf, other]

Efficient improper learning for online logistic regression

Authors: Rémi Jézéquel, Pierre Gaillard, Alessandro Rudi

Abstract: We consider the setting of online logistic regression and consider the regret with respect to the 2-ball of radius B. It is known (see [Hazan et al., 2014]) that any proper algorithm which has logarithmic regret in the number of samples (denoted n) necessarily suffers an exponential multiplicative constant in B. In this work, we design an efficient improper algorithm that avoids this exponential c… ▽ More We consider the setting of online logistic regression and consider the regret with respect to the 2-ball of radius B. It is known (see [Hazan et al., 2014]) that any proper algorithm which has logarithmic regret in the number of samples (denoted n) necessarily suffers an exponential multiplicative constant in B. In this work, we design an efficient improper algorithm that avoids this exponential constant while preserving a logarithmic regret. Indeed, [Foster et al., 2018] showed that the lower bound does not apply to improper algorithms and proposed a strategy based on exponential weights with prohibitive computational complexity. Our new algorithm based on regularized empirical risk minimization with surrogate losses satisfies a regret scaling as O(B log(Bn)) with a per-round time-complexity of order O(d^2). △ Less

Submitted 3 November, 2020; v1 submitted 18 March, 2020; originally announced March 2020.

Journal ref: Conference on Learning Theory 2020, Jul 2020, Graz, Austria

arXiv:2003.00920 [pdf, other]

Structured Prediction with Partial Labelling through the Infimum Loss

Authors: Vivien Cabannes, Alessandro Rudi, Francis Bach

Abstract: Annotating datasets is one of the main costs in nowadays supervised learning. The goal of weak supervision is to enable models to learn using only forms of labelling which are cheaper to collect, as partial labelling. This is a type of incomplete annotation where, for each datapoint, supervision is cast as a set of labels containing the real one. The problem of supervised learning with partial lab… ▽ More Annotating datasets is one of the main costs in nowadays supervised learning. The goal of weak supervision is to enable models to learn using only forms of labelling which are cheaper to collect, as partial labelling. This is a type of incomplete annotation where, for each datapoint, supervision is cast as a set of labels containing the real one. The problem of supervised learning with partial labelling has been studied for specific instances such as classification, multi-label, ranking or segmentation, but a general framework is still missing. This paper provides a unified framework based on structured prediction and on the concept of infimum loss to deal with partial labelling over a wide family of learning problems and loss functions. The framework leads naturally to explicit algorithms that can be easily implemented and for which proved statistical consistency and learning rates. Experiments confirm the superiority of the proposed approach over commonly used baselines. △ Less

Submitted 9 September, 2020; v1 submitted 2 March, 2020; originally announced March 2020.

Comments: 8 pages for main paper, 27 with main paper, 13 figures, 3 tables

MSC Class: 68Q32 ACM Class: I.2.6; G.3

Journal ref: Proceedings of the 37th International Conference on Machine Learning, PMLR 119:1230-1239, 2020

arXiv:2002.05424 [pdf, ps, other]

A General Framework for Consistent Structured Prediction with Implicit Loss Embeddings

Authors: Carlo Ciliberto, Lorenzo Rosasco, Alessandro Rudi

Abstract: We propose and analyze a novel theoretical and algorithmic framework for structured prediction. While so far the term has referred to discrete output spaces, here we consider more general settings, such as manifolds or spaces of probability measures. We define structured prediction as a problem where the output space lacks a vectorial structure. We identify and study a large class of loss function… ▽ More We propose and analyze a novel theoretical and algorithmic framework for structured prediction. While so far the term has referred to discrete output spaces, here we consider more general settings, such as manifolds or spaces of probability measures. We define structured prediction as a problem where the output space lacks a vectorial structure. We identify and study a large class of loss functions that implicitly defines a suitable geometry on the problem. The latter is the key to develop an algorithmic framework amenable to a sharp statistical analysis and yielding efficient computations. When dealing with output spaces with infinite cardinality, a suitable implicit formulation of the estimator is shown to be crucial. △ Less

Submitted 13 February, 2020; originally announced February 2020.

Comments: 53 pages

arXiv:2001.10477 [pdf, ps, other]

doi 10.1103/PhysRevA.102.042414

Statistical Limits of Supervised Quantum Learning

Authors: Carlo Ciliberto, Andrea Rocchetto, Alessandro Rudi, Leonard Wossnig

Abstract: Within the framework of statistical learning theory it is possible to bound the minimum number of samples required by a learner to reach a target accuracy. We show that if the bound on the accuracy is taken into account, quantum machine learning algorithms for supervised learning---for which statistical guarantees are available---cannot achieve polylogarithmic runtimes in the input dimension. We c… ▽ More Within the framework of statistical learning theory it is possible to bound the minimum number of samples required by a learner to reach a target accuracy. We show that if the bound on the accuracy is taken into account, quantum machine learning algorithms for supervised learning---for which statistical guarantees are available---cannot achieve polylogarithmic runtimes in the input dimension. We conclude that, when no further assumptions on the problem are made, quantum machine learning algorithms for supervised learning can have at most polynomial speedups over efficient classical algorithms, even in cases where quantum access to the data is naturally available. △ Less

Submitted 29 October, 2020; v1 submitted 28 January, 2020; originally announced January 2020.

Comments: v3: 6 pages, journal version, title changed (previous title "The Statistical Limits of Supervised Quantum Learning"), other minor improvements; v2: 6 pages, title changed (previous title "Fast quantum learning with statistical guarantees"), format changed to two-columns, typos corrected, remarks that better clarify the limitations of our analysis added

Journal ref: Phys. Rev. A 102, 042414 (2020)

arXiv:1910.14564 [pdf, other]

Statistical Estimation of the Poincar{é} constant and Application to Sampling Multimodal Distributions

Authors: Loucas Pillaud-Vivien, Francis Bach, Tony Lelièvre, Alessandro Rudi, Gabriel Stoltz

Abstract: Poincar{é} inequalities are ubiquitous in probability and analysis and have various applications in statistics (concentration of measure, rate of convergence of Markov chains). The Poincar{é} constant, for which the inequality is tight, is related to the typical convergence rate of diffusions to their equilibrium measure. In this paper, we show both theoretically and experimentally that, given suf… ▽ More Poincar{é} inequalities are ubiquitous in probability and analysis and have various applications in statistics (concentration of measure, rate of convergence of Markov chains). The Poincar{é} constant, for which the inequality is tight, is related to the typical convergence rate of diffusions to their equilibrium measure. In this paper, we show both theoretically and experimentally that, given sufficiently many samples of a measure, we can estimate its Poincar{é} constant. As a by-product of the estimation of the Poincar{é} constant, we derive an algorithm that captures a low dimensional representation of the data by finding directions which are difficult to sample. These directions are of crucial importance for sampling or in fields like molecular dynamics, where they are called reaction coordinates. Their knowledge can leverage, with a simple conditioning step, computational bottlenecks by using importance sampling techniques. △ Less

Submitted 22 November, 2019; v1 submitted 28 October, 2019; originally announced October 2019.

arXiv:1907.12758 [pdf, ps, other]

Packing Rotating Segments

Authors: Ali Gholami Rudi

Abstract: We show that the following variant of labeling rotating maps is NP-hard, and present a polynomial approximation scheme for solving it. The input is a set of feature points on a map, to each of which a vertical bar of zero width is assigned. The goal is to choose the largest subsets of the bars such that when the map is rotated and the labels remain vertical, none of the bars intersect. We extend t… ▽ More We show that the following variant of labeling rotating maps is NP-hard, and present a polynomial approximation scheme for solving it. The input is a set of feature points on a map, to each of which a vertical bar of zero width is assigned. The goal is to choose the largest subsets of the bars such that when the map is rotated and the labels remain vertical, none of the bars intersect. We extend this algorithm to the general case where labels are arbitrary objects. △ Less

Submitted 30 July, 2019; originally announced July 2019.

MSC Class: 68U05

arXiv:1907.05226 [pdf, other]

Gain with no Pain: Efficient Kernel-PCA by Nyström Sampling

Authors: Nicholas Sterge, Bharath Sriperumbudur, Lorenzo Rosasco, Alessandro Rudi

Abstract: In this paper, we propose and study a Nyström based approach to efficient large scale kernel principal component analysis (PCA). The latter is a natural nonlinear extension of classical PCA based on considering a nonlinear feature map or the corresponding kernel. Like other kernel approaches, kernel PCA enjoys good mathematical and statistical properties but, numerically, it scales poorly with the… ▽ More In this paper, we propose and study a Nyström based approach to efficient large scale kernel principal component analysis (PCA). The latter is a natural nonlinear extension of classical PCA based on considering a nonlinear feature map or the corresponding kernel. Like other kernel approaches, kernel PCA enjoys good mathematical and statistical properties but, numerically, it scales poorly with the sample size. Our analysis shows that Nyström sampling greatly improves computational efficiency without incurring any loss of statistical accuracy. While similar effects have been observed in supervised learning, this is the first such result for PCA. Our theoretical findings, which are also illustrated by numerical results, are based on a combination of analytic and concentration of measure techniques. Our study is more broadly motivated by the question of understanding the interplay between statistical and computational requirements for learning. △ Less

Submitted 11 July, 2019; originally announced July 2019.

Comments: 19 pages, 2 figures

MSC Class: 62H25; 62H12; 46E22

arXiv:1907.01771 [pdf, other]

Globally Convergent Newton Methods for Ill-conditioned Generalized Self-concordant Losses

Authors: Ulysse Marteau-Ferey, Francis Bach, Alessandro Rudi

Abstract: In this paper, we study large-scale convex optimization algorithms based on the Newton method applied to regularized generalized self-concordant losses, which include logistic regression and softmax regression. We first prove that our new simple scheme based on a sequence of problems with decreasing regularization parameters is provably globally convergent, that this convergence is linear with a c… ▽ More In this paper, we study large-scale convex optimization algorithms based on the Newton method applied to regularized generalized self-concordant losses, which include logistic regression and softmax regression. We first prove that our new simple scheme based on a sequence of problems with decreasing regularization parameters is provably globally convergent, that this convergence is linear with a constant factor which scales only logarithmically with the condition number. In the parametric setting, we obtain an algorithm with the same scaling than regular first-order methods but with an improved behavior, in particular in ill-conditioned problems. Second, in the non parametric machine learning setting, we provide an explicit algorithm combining the previous scheme with Nystr{ö}m projection techniques, and prove that it achieves optimal generalization bounds with a time complexity of order O(ndf $λ$), a memory complexity of order O(df 2 $λ$) and no dependence on the condition number, generalizing the results known for least-squares regression. Here n is the number of observations and df $λ$ is the associated degrees of freedom. In particular, this is the first large-scale algorithm to solve logistic and softmax regressions in the non-parametric setting with large condition numbers and theoretical guarantees. △ Less

Submitted 21 November, 2019; v1 submitted 3 July, 2019; originally announced July 2019.

Journal ref: NeurIPS 2019 - Conference on Neural Information Processing Systems, Dec 2019, Vancouver, Canada

arXiv:1902.09917 [pdf, other]

Efficient online learning with kernels for adversarial large scale problems

Authors: Rémi Jézéquel, Pierre Gaillard, Alessandro Rudi

Abstract: We are interested in a framework of online learning with kernels for low-dimensional but large-scale and potentially adversarial datasets. We study the computational and theoretical performance of online variations of kernel Ridge regression. Despite its simplicity, the algorithm we study is the first to achieve the optimal regret for a wide range of kernels with a per-round complexity of order… ▽ More We are interested in a framework of online learning with kernels for low-dimensional but large-scale and potentially adversarial datasets. We study the computational and theoretical performance of online variations of kernel Ridge regression. Despite its simplicity, the algorithm we study is the first to achieve the optimal regret for a wide range of kernels with a per-round complexity of order $n^α$ with $α< 2$. The algorithm we consider is based on approximating the kernel with the linear span of basis functions. Our contributions is two-fold: 1) For the Gaussian kernel, we propose to build the basis beforehand (independently of the data) through Taylor expansion. For $d$-dimensional inputs, we provide a (close to) optimal regret of order $O((\log n)^{d+1})$ with per-round time complexity and space complexity $O((\log n)^{2d})$. This makes the algorithm a suitable choice as soon as $n \gg e^d$ which is likely to happen in a scenario with small dimensional and large-scale dataset; 2) For general kernels with low effective dimension, the basis functions are updated sequentially in a data-adaptive fashion by sampling Nystr{ö}m points. In this case, our algorithm improves the computational trade-off known for online kernel regression. △ Less

Submitted 29 May, 2019; v1 submitted 26 February, 2019; originally announced February 2019.

arXiv:1902.03086 [pdf, ps, other]

Affine Invariant Covariance Estimation for Heavy-Tailed Distributions

Authors: Dmitrii Ostrovskii, Alessandro Rudi

Abstract: In this work we provide an estimator for the covariance matrix of a heavy-tailed multivariate distributionWe prove that the proposed estimator $\widehat{\mathbf{S}}$ admits an \textit{affine-invariant} bound of the form \[(1-\varepsilon) \mathbf{S} \preccurlyeq \widehat{\mathbf{S}} \preccurlyeq (1+\varepsilon) \mathbf{S}\]in high probability, where $\mathbf{S}$ is the unknown covariance matrix, an… ▽ More In this work we provide an estimator for the covariance matrix of a heavy-tailed multivariate distributionWe prove that the proposed estimator $\widehat{\mathbf{S}}$ admits an \textit{affine-invariant} bound of the form \[(1-\varepsilon) \mathbf{S} \preccurlyeq \widehat{\mathbf{S}} \preccurlyeq (1+\varepsilon) \mathbf{S}\]in high probability, where $\mathbf{S}$ is the unknown covariance matrix, and $\preccurlyeq$ is the positive semidefinite order on symmetric matrices. The result only requires the existence of fourth-order moments, and allows for $\varepsilon = O(\sqrt{κ^4 d\log(d/δ)/n})$ where $κ^4$ is a measure of kurtosis of the distribution, $d$ is the dimensionality of the space, $n$ is the sample size, and $1-δ$ is the desired confidence level. More generally, we can allow for regularization with level $λ$, then $d$ gets replaced with the degrees of freedom number. Denoting $\text{cond}(\mathbf{S})$ the condition number of $\mathbf{S}$, the computational cost of the novel estimator is $O(d^2 n + d^3\log(\text{cond}(\mathbf{S})))$, which is comparable to the cost of the sample covariance estimator in the statistically interesing regime $n \ge d$. We consider applications of our estimator to eigenvalue estimation with relative error, and to ridge regression with heavy-tailed random design. △ Less

Submitted 24 September, 2019; v1 submitted 8 February, 2019; originally announced February 2019.

Journal ref: 32nd Annual Conference on Learning Theory (COLT), 2019, Jun 2019, Phoenix, United States

arXiv:1902.03046 [pdf, ps, other]

Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance

Authors: Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, Alessandro Rudi

Abstract: We consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels. In order to go beyond the generic analysis leading to convergence rates of the excess risk as $O(1/\sqrt{n})$ from $n$ observations, we assume that the individual losses are self-conc… ▽ More We consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels. In order to go beyond the generic analysis leading to convergence rates of the excess risk as $O(1/\sqrt{n})$ from $n$ observations, we assume that the individual losses are self-concordant, that is, their third-order derivatives are bounded by their second-order derivatives. This setting includes least-squares, as well as all generalized linear models such as logistic and softmax regression. For this class of losses, we provide a bias-variance decomposition and show that the assumptions commonly made in least-squares regression, such as the source and capacity conditions, can be adapted to obtain fast non-asymptotic rates of convergence by improving the bias terms, the variance terms or both. △ Less

Submitted 18 June, 2019; v1 submitted 8 February, 2019; originally announced February 2019.

arXiv:1902.01958 [pdf, other]

A General Theory for Structured Prediction with Smooth Convex Surrogates

Authors: Alex Nowak-Vila, Francis Bach, Alessandro Rudi

Abstract: In this work we provide a theoretical framework for structured prediction that generalizes the existing theory of surrogate methods for binary and multiclass classification based on estimating conditional probabilities with smooth convex surrogates (e.g. logistic regression). The theory relies on a natural characterization of structural properties of the task loss and allows to derive statistical… ▽ More In this work we provide a theoretical framework for structured prediction that generalizes the existing theory of surrogate methods for binary and multiclass classification based on estimating conditional probabilities with smooth convex surrogates (e.g. logistic regression). The theory relies on a natural characterization of structural properties of the task loss and allows to derive statistical guarantees for many widely used methods in the context of multilabeling, ranking, ordinal regression and graph matching. In particular, we characterize the smooth convex surrogates compatible with a given task loss in terms of a suitable Bregman divergence composed with a link function. This allows to derive tight bounds for the calibration function and to obtain novel results on existing surrogate frameworks for structured prediction such as conditional random fields and quadratic surrogates. △ Less

Submitted 13 February, 2019; v1 submitted 5 February, 2019; originally announced February 2019.

arXiv:1901.01763 [pdf, ps, other]

Approximate Discontinuous Trajectory Hotspots

Authors: Ali Gholami Rudi

Abstract: A hotspot is an axis-aligned square of fixed side length $s$, the duration of the presence of an entity moving in the plane in which is maximised. An exact hotspot of a polygonal trajectory with $n$ edges can be found in $O(n^2)$. Defining a $c$-approximate hotspot as an axis-aligned square of side length $cs$, in which the duration of the entity's presence is no less than that of an exact hotspot… ▽ More A hotspot is an axis-aligned square of fixed side length $s$, the duration of the presence of an entity moving in the plane in which is maximised. An exact hotspot of a polygonal trajectory with $n$ edges can be found in $O(n^2)$. Defining a $c$-approximate hotspot as an axis-aligned square of side length $cs$, in which the duration of the entity's presence is no less than that of an exact hotspot, in this paper we present an algorithm to find a $(1 + ε)$-approximate hotspot of a polygonal trajectory with the time complexity $O({nφ\over ε} \log {nφ\over ε})$, where $φ$ is the ratio of average trajectory edge length to $s$. △ Less

Submitted 7 January, 2019; originally announced January 2019.

MSC Class: 68U05 ACM Class: F.2.2

Showing 1–50 of 75 results for author: Rudi, A