-
A PAC-Bayesian Link Between Generalisation and Flat Minima
Authors:
Maxime Haddouche,
Paul Viallard,
Umut Simsekli,
Benjamin Guedj
Abstract:
Modern machine learning usually involves predictors in the overparametrised setting (number of trained parameters greater than dataset size), and their training yield not only good performances on training data, but also good generalisation capacity. This phenomenon challenges many theoretical results, and remains an open problem. To reach a better understanding, we provide novel generalisation bo…
▽ More
Modern machine learning usually involves predictors in the overparametrised setting (number of trained parameters greater than dataset size), and their training yield not only good performances on training data, but also good generalisation capacity. This phenomenon challenges many theoretical results, and remains an open problem. To reach a better understanding, we provide novel generalisation bounds involving gradient terms. To do so, we combine the PAC-Bayes toolbox with Poincaré and Log-Sobolev inequalities, avoiding an explicit dependency on dimension of the predictor space. Our results highlight the positive influence of \emph{flat minima} (being minima with a neighbourhood nearly minimising the learning problem as well) on generalisation performances, involving directly the benefits of the optimisation phase.
△ Less
Submitted 13 February, 2024;
originally announced February 2024.
-
Tighter Generalisation Bounds via Interpolation
Authors:
Paul Viallard,
Maxime Haddouche,
Umut Şimşekli,
Benjamin Guedj
Abstract:
This paper contains a recipe for deriving new PAC-Bayes generalisation bounds based on the $(f, Γ)$-divergence, and, in addition, presents PAC-Bayes generalisation bounds where we interpolate between a series of probability divergences (including but not limited to KL, Wasserstein, and total variation), making the best out of many worlds depending on the posterior distributions properties. We expl…
▽ More
This paper contains a recipe for deriving new PAC-Bayes generalisation bounds based on the $(f, Γ)$-divergence, and, in addition, presents PAC-Bayes generalisation bounds where we interpolate between a series of probability divergences (including but not limited to KL, Wasserstein, and total variation), making the best out of many worlds depending on the posterior distributions properties. We explore the tightness of these bounds and connect them to earlier results from statistical learning, which are specific cases. We also instantiate our bounds as training objectives, yielding non-trivial guarantees and practical performances.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Federated Learning with Nonvacuous Generalisation Bounds
Authors:
Pierre Jobic,
Maxime Haddouche,
Benjamin Guedj
Abstract:
We introduce a novel strategy to train randomised predictors in federated learning, where each node of the network aims at preserving its privacy by releasing a local predictor but kee** secret its training dataset with respect to the other nodes. We then build a global randomised predictor which inherits the properties of the local private predictors in the sense of a PAC-Bayesian generalisatio…
▽ More
We introduce a novel strategy to train randomised predictors in federated learning, where each node of the network aims at preserving its privacy by releasing a local predictor but kee** secret its training dataset with respect to the other nodes. We then build a global randomised predictor which inherits the properties of the local private predictors in the sense of a PAC-Bayesian generalisation bound. We consider the synchronous case where all nodes share the same training objective (derived from a generalisation bound), and the asynchronous case where each node may have its own personalised training objective. We show through a series of numerical experiments that our approach achieves a comparable predictive performance to that of the batch approach where all datasets are shared across nodes. Moreover the predictors are supported by numerically nonvacuous generalisation bounds while preserving privacy for each node. We explicitly compute the increment on predictive performance and generalisation bounds between batch and federated settings, highlighting the price to pay to preserve privacy.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
Learning via Wasserstein-Based High Probability Generalisation Bounds
Authors:
Paul Viallard,
Maxime Haddouche,
Umut Şimşekli,
Benjamin Guedj
Abstract:
Minimising upper bounds on the population risk or the generalisation gap has been widely used in structural risk minimisation (SRM) -- this is in particular at the core of PAC-Bayesian learning. Despite its successes and unfailing surge of interest in recent years, a limitation of the PAC-Bayesian framework is that most bounds involve a Kullback-Leibler (KL) divergence term (or its variations), wh…
▽ More
Minimising upper bounds on the population risk or the generalisation gap has been widely used in structural risk minimisation (SRM) -- this is in particular at the core of PAC-Bayesian learning. Despite its successes and unfailing surge of interest in recent years, a limitation of the PAC-Bayesian framework is that most bounds involve a Kullback-Leibler (KL) divergence term (or its variations), which might exhibit erratic behavior and fail to capture the underlying geometric structure of the learning problem -- hence restricting its use in practical applications. As a remedy, recent studies have attempted to replace the KL divergence in the PAC-Bayesian bounds with the Wasserstein distance. Even though these bounds alleviated the aforementioned issues to a certain extent, they either hold in expectation, are for bounded losses, or are nontrivial to minimize in an SRM framework. In this work, we contribute to this line of research and prove novel Wasserstein distance-based PAC-Bayesian generalisation bounds for both batch learning with independent and identically distributed (i.i.d.) data, and online learning with potentially non-i.i.d. data. Contrary to previous art, our bounds are stronger in the sense that (i) they hold with high probability, (ii) they apply to unbounded (potentially heavy-tailed) losses, and (iii) they lead to optimizable training objectives that can be used in SRM. As a result we derive novel Wasserstein-based PAC-Bayesian learning algorithms and we illustrate their empirical advantage on a variety of experiments.
△ Less
Submitted 27 October, 2023; v1 submitted 7 June, 2023;
originally announced June 2023.
-
Wasserstein PAC-Bayes Learning: Exploiting Optimisation Guarantees to Explain Generalisation
Authors:
Maxime Haddouche,
Benjamin Guedj
Abstract:
PAC-Bayes learning is an established framework to both assess the generalisation ability of learning algorithms, and design new learning algorithm by exploiting generalisation bounds as training objectives. Most of the exisiting bounds involve a \emph{Kullback-Leibler} (KL) divergence, which fails to capture the geometric properties of the loss function which are often useful in optimisation. We a…
▽ More
PAC-Bayes learning is an established framework to both assess the generalisation ability of learning algorithms, and design new learning algorithm by exploiting generalisation bounds as training objectives. Most of the exisiting bounds involve a \emph{Kullback-Leibler} (KL) divergence, which fails to capture the geometric properties of the loss function which are often useful in optimisation. We address this by extending the emerging \emph{Wasserstein PAC-Bayes} theory. We develop new PAC-Bayes bounds with Wasserstein distances replacing the usual KL, and demonstrate that sound optimisation guarantees translate to good generalisation abilities. In particular we provide generalisation bounds for the \emph{Bures-Wasserstein SGD} by exploiting its optimisation properties.
△ Less
Submitted 30 May, 2023; v1 submitted 14 April, 2023;
originally announced April 2023.
-
Optimistically Tempered Online Learning
Authors:
Maxime Haddouche,
Olivier Wintenberger,
Benjamin Guedj
Abstract:
Optimistic Online Learning algorithms have been developed to exploit expert advices, assumed optimistically to be always useful. However, it is legitimate to question the relevance of such advices \emph{w.r.t.} the learning information provided by gradient-based online algorithms. In this work, we challenge the confidence assumption on the expert and develop the \emph{optimistically tempered} (OT)…
▽ More
Optimistic Online Learning algorithms have been developed to exploit expert advices, assumed optimistically to be always useful. However, it is legitimate to question the relevance of such advices \emph{w.r.t.} the learning information provided by gradient-based online algorithms. In this work, we challenge the confidence assumption on the expert and develop the \emph{optimistically tempered} (OT) online learning framework as well as OT adaptations of online algorithms. Our algorithms come with sound theoretical guarantees in the form of dynamic regret bounds, and we eventually provide experimental validation of the usefulness of the OT approach.
△ Less
Submitted 14 February, 2024; v1 submitted 18 January, 2023;
originally announced January 2023.
-
PAC-Bayes Generalisation Bounds for Heavy-Tailed Losses through Supermartingales
Authors:
Maxime Haddouche,
Benjamin Guedj
Abstract:
While PAC-Bayes is now an established learning framework for light-tailed losses (\emph{e.g.}, subgaussian or subexponential), its extension to the case of heavy-tailed losses remains largely uncharted and has attracted a growing interest in recent years. We contribute PAC-Bayes generalisation bounds for heavy-tailed losses under the sole assumption of bounded variance of the loss function. Under…
▽ More
While PAC-Bayes is now an established learning framework for light-tailed losses (\emph{e.g.}, subgaussian or subexponential), its extension to the case of heavy-tailed losses remains largely uncharted and has attracted a growing interest in recent years. We contribute PAC-Bayes generalisation bounds for heavy-tailed losses under the sole assumption of bounded variance of the loss function. Under that assumption, we extend previous results from \citet{kuzborskij2019efron}. Our key technical contribution is exploiting an extention of Markov's inequality for supermartingales. Our proof technique unifies and extends different PAC-Bayesian frameworks by providing bounds for unbounded martingales as well as bounds for batch and online learning with heavy-tailed losses.
△ Less
Submitted 24 April, 2023; v1 submitted 3 October, 2022;
originally announced October 2022.
-
Online PAC-Bayes Learning
Authors:
Maxime Haddouche,
Benjamin Guedj
Abstract:
Most PAC-Bayesian bounds hold in the batch learning setting where data is collected at once, prior to inference or prediction. This somewhat departs from many contemporary learning problems where data streams are collected and the algorithms must dynamically adjust. We prove new PAC-Bayesian bounds in this online learning framework, leveraging an updated definition of regret, and we revisit classi…
▽ More
Most PAC-Bayesian bounds hold in the batch learning setting where data is collected at once, prior to inference or prediction. This somewhat departs from many contemporary learning problems where data streams are collected and the algorithms must dynamically adjust. We prove new PAC-Bayesian bounds in this online learning framework, leveraging an updated definition of regret, and we revisit classical PAC-Bayesian results with a batch-to-online conversion, extending their remit to the case of dependent data. Our results hold for bounded losses, potentially \emph{non-convex}, paving the way to promising developments in online learning.
△ Less
Submitted 13 October, 2022; v1 submitted 31 May, 2022;
originally announced June 2022.
-
A unified approach for covariance matrix estimation under Stein loss
Authors:
Anis M. Haddouche,
Wei Lu
Abstract:
In this paper, we address the problem of estimating a covariance matrix of a multivariate Gaussian distribution, relative to a Stein loss function, from a decision theoretic point of view. We investigate the case where the covariance matrix is invertible and the case when it is non--invertible in a unified approach.
In this paper, we address the problem of estimating a covariance matrix of a multivariate Gaussian distribution, relative to a Stein loss function, from a decision theoretic point of view. We investigate the case where the covariance matrix is invertible and the case when it is non--invertible in a unified approach.
△ Less
Submitted 20 March, 2021;
originally announced March 2021.
-
Covariance matrix estimation under data-based loss
Authors:
Anis M. Haddouche,
Dominique Fourdrinier,
Fatiha Mezoued
Abstract:
In this paper, we consider the problem of estimating the $p\times p$ scale matrix $Σ$ of a multivariate linear regression model $Y=X\,β+ \mathcal{E}\,$ when the distribution of the observed matrix $Y$ belongs to a large class of elliptically symmetric distributions. After deriving the canonical form $(Z^T U^T)^T$ of this model, any estimator $\hat{ Σ}$ of $Σ$ is assessed through the data-based los…
▽ More
In this paper, we consider the problem of estimating the $p\times p$ scale matrix $Σ$ of a multivariate linear regression model $Y=X\,β+ \mathcal{E}\,$ when the distribution of the observed matrix $Y$ belongs to a large class of elliptically symmetric distributions. After deriving the canonical form $(Z^T U^T)^T$ of this model, any estimator $\hat{ Σ}$ of $Σ$ is assessed through the data-based loss tr$(S^{+}Σ\, (Σ^{-1}\hatΣ - I_p)^2 )\,$ where $S=U^T U$ is the sample covariance matrix and $S^{+}$ is its Moore-Penrose inverse. We provide alternative estimators to the usual estimators $a\,S$, where $a$ is a positive constant, which present smaller associated risk. Compared to the usual quadratic loss tr$(Σ^{-1}\hatΣ - I_p)^2$, we obtain a larger class of estimators and a wider class of elliptical distributions for which such an improvement occurs. A numerical study illustrates the theory.
△ Less
Submitted 22 December, 2020;
originally announced December 2020.
-
Upper and Lower Bounds on the Performance of Kernel PCA
Authors:
Maxime Haddouche,
Benjamin Guedj,
John Shawe-Taylor
Abstract:
Principal Component Analysis (PCA) is a popular method for dimension reduction and has attracted an unfailing interest for decades. More recently, kernel PCA (KPCA) has emerged as an extension of PCA but, despite its use in practice, a sound theoretical understanding of KPCA is missing. We contribute several lower and upper bounds on the efficiency of KPCA, involving the empirical eigenvalues of t…
▽ More
Principal Component Analysis (PCA) is a popular method for dimension reduction and has attracted an unfailing interest for decades. More recently, kernel PCA (KPCA) has emerged as an extension of PCA but, despite its use in practice, a sound theoretical understanding of KPCA is missing. We contribute several lower and upper bounds on the efficiency of KPCA, involving the empirical eigenvalues of the kernel Gram matrix and new quantities involving a notion of variance. These bounds show how much information is captured by KPCA on average and contribute a better theoretical understanding of its efficiency. We demonstrate that fast convergence rates are achievable for a widely used class of kernels and we highlight the importance of some desirable properties of datasets to ensure KPCA efficiency.
△ Less
Submitted 23 January, 2023; v1 submitted 18 December, 2020;
originally announced December 2020.
-
PAC-Bayes unleashed: generalisation bounds with unbounded losses
Authors:
Maxime Haddouche,
Benjamin Guedj,
Omar Rivasplata,
John Shawe-Taylor
Abstract:
We present new PAC-Bayesian generalisation bounds for learning problems with unbounded loss functions. This extends the relevance and applicability of the PAC-Bayes learning framework, where most of the existing literature focuses on supervised learning problems with a bounded loss function (typically assumed to take values in the interval [0;1]). In order to relax this assumption, we propose a ne…
▽ More
We present new PAC-Bayesian generalisation bounds for learning problems with unbounded loss functions. This extends the relevance and applicability of the PAC-Bayes learning framework, where most of the existing literature focuses on supervised learning problems with a bounded loss function (typically assumed to take values in the interval [0;1]). In order to relax this assumption, we propose a new notion called HYPE (standing for \emph{HYPothesis-dependent rangE}), which effectively allows the range of the loss to depend on each predictor. Based on this new notion we derive a novel PAC-Bayesian generalisation bound for unbounded loss functions, and we instantiate it on a linear regression problem. To make our theory usable by the largest audience possible, we include discussions on actual computation, practicality and limitations of our assumptions.
△ Less
Submitted 30 September, 2020; v1 submitted 12 June, 2020;
originally announced June 2020.
-
Scale matrix estimation under data-based loss in high and low dimensions
Authors:
Mohamed Anis Haddouche,
Dominique Fourdrinier,
Fatiha Mezoued
Abstract:
We consider the problem of estimating the scale matrix $Σ$ of the additif model $Y_{p\times n} = M + \mathcal{E}$, under a theoretical decision point of view. Here, $ p $ is the number of variables, $ n$ is the number of observations, $ M $ is a matrix of unknown parameters with rank $q<p$ and $ \mathcal {E}$ is a random noise, whose distribution is elliptically symmetric with covariance matrix pr…
▽ More
We consider the problem of estimating the scale matrix $Σ$ of the additif model $Y_{p\times n} = M + \mathcal{E}$, under a theoretical decision point of view. Here, $ p $ is the number of variables, $ n$ is the number of observations, $ M $ is a matrix of unknown parameters with rank $q<p$ and $ \mathcal {E}$ is a random noise, whose distribution is elliptically symmetric with covariance matrix proportional to $ I_n \otimes Σ$\,. We deal with a canonical form of this model where $Y$ is decomposed in two matrices, namely, $Z_{q\times p}$ which summarizes the information contained in $ M $, and $ U_{m\times p}$, where $m=n-q$, which summarizes the sufficient information to estimate $ Σ$. As the natural estimators of the form ${\hat Σ}_a=a\, S$ (where $ S=U^{T}\,U$ and $a$ is a positive constant) perform poorly when $p >m$ (S non-invertible), we propose estimators of the form ${\hatΣ}_{a, G} = a\big( S+ S \, {S^{+}\,G(Z,S)}\big)$ where ${S^{+}}$ is the Moore-Penrose inverse of $ S$ (which coincides with $S^{-1}$ when $S$ is invertible). We provide conditions on the correction matrix $SS^{+}{G(Z,S)}$ such that ${\hat Σ}_{a, G}$ improves over ${\hat Σ}_a$ under the data-based loss $L _S( Σ, \hat { Σ}) ={\rm tr} \big ( S^{+}Σ\,({\hatΣ} \, Σ ^ {- 1} - {I}_ {p} )^ {2}\big) $. We adopt a unified approach of the two cases where $ S$ is invertible ($p \leq m$) and $ S$ is non-invertible ($p>m$).
△ Less
Submitted 30 May, 2020;
originally announced June 2020.