Search | arXiv e-print repository

A PAC-Bayesian Link Between Generalisation and Flat Minima

Authors: Maxime Haddouche, Paul Viallard, Umut Simsekli, Benjamin Guedj

Abstract: Modern machine learning usually involves predictors in the overparametrised setting (number of trained parameters greater than dataset size), and their training yield not only good performances on training data, but also good generalisation capacity. This phenomenon challenges many theoretical results, and remains an open problem. To reach a better understanding, we provide novel generalisation bo… ▽ More Modern machine learning usually involves predictors in the overparametrised setting (number of trained parameters greater than dataset size), and their training yield not only good performances on training data, but also good generalisation capacity. This phenomenon challenges many theoretical results, and remains an open problem. To reach a better understanding, we provide novel generalisation bounds involving gradient terms. To do so, we combine the PAC-Bayes toolbox with Poincaré and Log-Sobolev inequalities, avoiding an explicit dependency on dimension of the predictor space. Our results highlight the positive influence of \emph{flat minima} (being minima with a neighbourhood nearly minimising the learning problem as well) on generalisation performances, involving directly the benefits of the optimisation phase. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: We provide novel PAC-Bayesian generalisation bounds involving gradient norms and being interpretable under the lens of flat minima

arXiv:2402.05101 [pdf, ps, other]

Tighter Generalisation Bounds via Interpolation

Authors: Paul Viallard, Maxime Haddouche, Umut Şimşekli, Benjamin Guedj

Abstract: This paper contains a recipe for deriving new PAC-Bayes generalisation bounds based on the $(f, Γ)$-divergence, and, in addition, presents PAC-Bayes generalisation bounds where we interpolate between a series of probability divergences (including but not limited to KL, Wasserstein, and total variation), making the best out of many worlds depending on the posterior distributions properties. We expl… ▽ More This paper contains a recipe for deriving new PAC-Bayes generalisation bounds based on the $(f, Γ)$-divergence, and, in addition, presents PAC-Bayes generalisation bounds where we interpolate between a series of probability divergences (including but not limited to KL, Wasserstein, and total variation), making the best out of many worlds depending on the posterior distributions properties. We explore the tightness of these bounds and connect them to earlier results from statistical learning, which are specific cases. We also instantiate our bounds as training objectives, yielding non-trivial guarantees and practical performances. △ Less

Submitted 7 February, 2024; originally announced February 2024.

arXiv:2310.11203 [pdf, other]

Federated Learning with Nonvacuous Generalisation Bounds

Authors: Pierre Jobic, Maxime Haddouche, Benjamin Guedj

Abstract: We introduce a novel strategy to train randomised predictors in federated learning, where each node of the network aims at preserving its privacy by releasing a local predictor but kee** secret its training dataset with respect to the other nodes. We then build a global randomised predictor which inherits the properties of the local private predictors in the sense of a PAC-Bayesian generalisatio… ▽ More We introduce a novel strategy to train randomised predictors in federated learning, where each node of the network aims at preserving its privacy by releasing a local predictor but kee** secret its training dataset with respect to the other nodes. We then build a global randomised predictor which inherits the properties of the local private predictors in the sense of a PAC-Bayesian generalisation bound. We consider the synchronous case where all nodes share the same training objective (derived from a generalisation bound), and the asynchronous case where each node may have its own personalised training objective. We show through a series of numerical experiments that our approach achieves a comparable predictive performance to that of the batch approach where all datasets are shared across nodes. Moreover the predictors are supported by numerically nonvacuous generalisation bounds while preserving privacy for each node. We explicitly compute the increment on predictive performance and generalisation bounds between batch and federated settings, highlighting the price to pay to preserve privacy. △ Less

Submitted 17 October, 2023; originally announced October 2023.

arXiv:2306.04375 [pdf, ps, other]

Learning via Wasserstein-Based High Probability Generalisation Bounds

Authors: Paul Viallard, Maxime Haddouche, Umut Şimşekli, Benjamin Guedj

Abstract: Minimising upper bounds on the population risk or the generalisation gap has been widely used in structural risk minimisation (SRM) -- this is in particular at the core of PAC-Bayesian learning. Despite its successes and unfailing surge of interest in recent years, a limitation of the PAC-Bayesian framework is that most bounds involve a Kullback-Leibler (KL) divergence term (or its variations), wh… ▽ More Minimising upper bounds on the population risk or the generalisation gap has been widely used in structural risk minimisation (SRM) -- this is in particular at the core of PAC-Bayesian learning. Despite its successes and unfailing surge of interest in recent years, a limitation of the PAC-Bayesian framework is that most bounds involve a Kullback-Leibler (KL) divergence term (or its variations), which might exhibit erratic behavior and fail to capture the underlying geometric structure of the learning problem -- hence restricting its use in practical applications. As a remedy, recent studies have attempted to replace the KL divergence in the PAC-Bayesian bounds with the Wasserstein distance. Even though these bounds alleviated the aforementioned issues to a certain extent, they either hold in expectation, are for bounded losses, or are nontrivial to minimize in an SRM framework. In this work, we contribute to this line of research and prove novel Wasserstein distance-based PAC-Bayesian generalisation bounds for both batch learning with independent and identically distributed (i.i.d.) data, and online learning with potentially non-i.i.d. data. Contrary to previous art, our bounds are stronger in the sense that (i) they hold with high probability, (ii) they apply to unbounded (potentially heavy-tailed) losses, and (iii) they lead to optimizable training objectives that can be used in SRM. As a result we derive novel Wasserstein-based PAC-Bayesian learning algorithms and we illustrate their empirical advantage on a variety of experiments. △ Less

Submitted 27 October, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: Accepted to NeurIPS 2023

arXiv:2304.07048 [pdf, other]

Wasserstein PAC-Bayes Learning: Exploiting Optimisation Guarantees to Explain Generalisation

Authors: Maxime Haddouche, Benjamin Guedj

Abstract: PAC-Bayes learning is an established framework to both assess the generalisation ability of learning algorithms, and design new learning algorithm by exploiting generalisation bounds as training objectives. Most of the exisiting bounds involve a \emph{Kullback-Leibler} (KL) divergence, which fails to capture the geometric properties of the loss function which are often useful in optimisation. We a… ▽ More PAC-Bayes learning is an established framework to both assess the generalisation ability of learning algorithms, and design new learning algorithm by exploiting generalisation bounds as training objectives. Most of the exisiting bounds involve a \emph{Kullback-Leibler} (KL) divergence, which fails to capture the geometric properties of the loss function which are often useful in optimisation. We address this by extending the emerging \emph{Wasserstein PAC-Bayes} theory. We develop new PAC-Bayes bounds with Wasserstein distances replacing the usual KL, and demonstrate that sound optimisation guarantees translate to good generalisation abilities. In particular we provide generalisation bounds for the \emph{Bures-Wasserstein SGD} by exploiting its optimisation properties. △ Less

Submitted 30 May, 2023; v1 submitted 14 April, 2023; originally announced April 2023.

arXiv:2301.07530 [pdf, other]

Optimistically Tempered Online Learning

Authors: Maxime Haddouche, Olivier Wintenberger, Benjamin Guedj

Abstract: Optimistic Online Learning algorithms have been developed to exploit expert advices, assumed optimistically to be always useful. However, it is legitimate to question the relevance of such advices \emph{w.r.t.} the learning information provided by gradient-based online algorithms. In this work, we challenge the confidence assumption on the expert and develop the \emph{optimistically tempered} (OT)… ▽ More Optimistic Online Learning algorithms have been developed to exploit expert advices, assumed optimistically to be always useful. However, it is legitimate to question the relevance of such advices \emph{w.r.t.} the learning information provided by gradient-based online algorithms. In this work, we challenge the confidence assumption on the expert and develop the \emph{optimistically tempered} (OT) online learning framework as well as OT adaptations of online algorithms. Our algorithms come with sound theoretical guarantees in the form of dynamic regret bounds, and we eventually provide experimental validation of the usefulness of the OT approach. △ Less

Submitted 14 February, 2024; v1 submitted 18 January, 2023; originally announced January 2023.

arXiv:2210.00928 [pdf, ps, other]

PAC-Bayes Generalisation Bounds for Heavy-Tailed Losses through Supermartingales

Authors: Maxime Haddouche, Benjamin Guedj

Abstract: While PAC-Bayes is now an established learning framework for light-tailed losses (\emph{e.g.}, subgaussian or subexponential), its extension to the case of heavy-tailed losses remains largely uncharted and has attracted a growing interest in recent years. We contribute PAC-Bayes generalisation bounds for heavy-tailed losses under the sole assumption of bounded variance of the loss function. Under… ▽ More While PAC-Bayes is now an established learning framework for light-tailed losses (\emph{e.g.}, subgaussian or subexponential), its extension to the case of heavy-tailed losses remains largely uncharted and has attracted a growing interest in recent years. We contribute PAC-Bayes generalisation bounds for heavy-tailed losses under the sole assumption of bounded variance of the loss function. Under that assumption, we extend previous results from \citet{kuzborskij2019efron}. Our key technical contribution is exploiting an extention of Markov's inequality for supermartingales. Our proof technique unifies and extends different PAC-Bayesian frameworks by providing bounds for unbounded martingales as well as bounds for batch and online learning with heavy-tailed losses. △ Less

Submitted 24 April, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: New Section 3 on Online PAC-Bayes

arXiv:2206.00024 [pdf, other]

Online PAC-Bayes Learning

Authors: Maxime Haddouche, Benjamin Guedj

Abstract: Most PAC-Bayesian bounds hold in the batch learning setting where data is collected at once, prior to inference or prediction. This somewhat departs from many contemporary learning problems where data streams are collected and the algorithms must dynamically adjust. We prove new PAC-Bayesian bounds in this online learning framework, leveraging an updated definition of regret, and we revisit classi… ▽ More Most PAC-Bayesian bounds hold in the batch learning setting where data is collected at once, prior to inference or prediction. This somewhat departs from many contemporary learning problems where data streams are collected and the algorithms must dynamically adjust. We prove new PAC-Bayesian bounds in this online learning framework, leveraging an updated definition of regret, and we revisit classical PAC-Bayesian results with a batch-to-online conversion, extending their remit to the case of dependent data. Our results hold for bounded losses, potentially \emph{non-convex}, paving the way to promising developments in online learning. △ Less

Submitted 13 October, 2022; v1 submitted 31 May, 2022; originally announced June 2022.

Comments: 21 pages

Journal ref: 36th Conference on Neural Information Processing Systems (NeurIPS 2022)

arXiv:2103.11147 [pdf, ps, other]

A unified approach for covariance matrix estimation under Stein loss

Authors: Anis M. Haddouche, Wei Lu

Abstract: In this paper, we address the problem of estimating a covariance matrix of a multivariate Gaussian distribution, relative to a Stein loss function, from a decision theoretic point of view. We investigate the case where the covariance matrix is invertible and the case when it is non--invertible in a unified approach. In this paper, we address the problem of estimating a covariance matrix of a multivariate Gaussian distribution, relative to a Stein loss function, from a decision theoretic point of view. We investigate the case where the covariance matrix is invertible and the case when it is non--invertible in a unified approach. △ Less

Submitted 20 March, 2021; originally announced March 2021.

arXiv:2012.11920 [pdf, other]

Covariance matrix estimation under data-based loss

Authors: Anis M. Haddouche, Dominique Fourdrinier, Fatiha Mezoued

Abstract: In this paper, we consider the problem of estimating the $p\times p$ scale matrix $Σ$ of a multivariate linear regression model $Y=X\,β+ \mathcal{E}\,$ when the distribution of the observed matrix $Y$ belongs to a large class of elliptically symmetric distributions. After deriving the canonical form $(Z^T U^T)^T$ of this model, any estimator $\hat{ Σ}$ of $Σ$ is assessed through the data-based los… ▽ More In this paper, we consider the problem of estimating the $p\times p$ scale matrix $Σ$ of a multivariate linear regression model $Y=X\,β+ \mathcal{E}\,$ when the distribution of the observed matrix $Y$ belongs to a large class of elliptically symmetric distributions. After deriving the canonical form $(Z^T U^T)^T$ of this model, any estimator $\hat{ Σ}$ of $Σ$ is assessed through the data-based loss tr$(S^{+}Σ\, (Σ^{-1}\hatΣ - I_p)^2 )\,$ where $S=U^T U$ is the sample covariance matrix and $S^{+}$ is its Moore-Penrose inverse. We provide alternative estimators to the usual estimators $a\,S$, where $a$ is a positive constant, which present smaller associated risk. Compared to the usual quadratic loss tr$(Σ^{-1}\hatΣ - I_p)^2$, we obtain a larger class of estimators and a wider class of elliptical distributions for which such an improvement occurs. A numerical study illustrates the theory. △ Less

Submitted 22 December, 2020; originally announced December 2020.

arXiv:2012.10369 [pdf, ps, other]

Upper and Lower Bounds on the Performance of Kernel PCA

Authors: Maxime Haddouche, Benjamin Guedj, John Shawe-Taylor

Abstract: Principal Component Analysis (PCA) is a popular method for dimension reduction and has attracted an unfailing interest for decades. More recently, kernel PCA (KPCA) has emerged as an extension of PCA but, despite its use in practice, a sound theoretical understanding of KPCA is missing. We contribute several lower and upper bounds on the efficiency of KPCA, involving the empirical eigenvalues of t… ▽ More Principal Component Analysis (PCA) is a popular method for dimension reduction and has attracted an unfailing interest for decades. More recently, kernel PCA (KPCA) has emerged as an extension of PCA but, despite its use in practice, a sound theoretical understanding of KPCA is missing. We contribute several lower and upper bounds on the efficiency of KPCA, involving the empirical eigenvalues of the kernel Gram matrix and new quantities involving a notion of variance. These bounds show how much information is captured by KPCA on average and contribute a better theoretical understanding of its efficiency. We demonstrate that fast convergence rates are achievable for a widely used class of kernels and we highlight the importance of some desirable properties of datasets to ensure KPCA efficiency. △ Less

Submitted 23 January, 2023; v1 submitted 18 December, 2020; originally announced December 2020.

Comments: 16 pages

arXiv:2006.07279 [pdf, other]

doi 10.3390/e23101330

PAC-Bayes unleashed: generalisation bounds with unbounded losses

Authors: Maxime Haddouche, Benjamin Guedj, Omar Rivasplata, John Shawe-Taylor

Abstract: We present new PAC-Bayesian generalisation bounds for learning problems with unbounded loss functions. This extends the relevance and applicability of the PAC-Bayes learning framework, where most of the existing literature focuses on supervised learning problems with a bounded loss function (typically assumed to take values in the interval [0;1]). In order to relax this assumption, we propose a ne… ▽ More We present new PAC-Bayesian generalisation bounds for learning problems with unbounded loss functions. This extends the relevance and applicability of the PAC-Bayes learning framework, where most of the existing literature focuses on supervised learning problems with a bounded loss function (typically assumed to take values in the interval [0;1]). In order to relax this assumption, we propose a new notion called HYPE (standing for \emph{HYPothesis-dependent rangE}), which effectively allows the range of the loss to depend on each predictor. Based on this new notion we derive a novel PAC-Bayesian generalisation bound for unbounded loss functions, and we instantiate it on a linear regression problem. To make our theory usable by the largest audience possible, we include discussions on actual computation, practicality and limitations of our assumptions. △ Less

Submitted 30 September, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: 24 pages

Journal ref: Entropy 2021

arXiv:2006.00243 [pdf, ps, other]

Scale matrix estimation under data-based loss in high and low dimensions

Authors: Mohamed Anis Haddouche, Dominique Fourdrinier, Fatiha Mezoued

Abstract: We consider the problem of estimating the scale matrix $Σ$ of the additif model $Y_{p\times n} = M + \mathcal{E}$, under a theoretical decision point of view. Here, $ p $ is the number of variables, $ n$ is the number of observations, $ M $ is a matrix of unknown parameters with rank $q<p$ and $ \mathcal {E}$ is a random noise, whose distribution is elliptically symmetric with covariance matrix pr… ▽ More We consider the problem of estimating the scale matrix $Σ$ of the additif model $Y_{p\times n} = M + \mathcal{E}$, under a theoretical decision point of view. Here, $ p $ is the number of variables, $ n$ is the number of observations, $ M $ is a matrix of unknown parameters with rank $q<p$ and $ \mathcal {E}$ is a random noise, whose distribution is elliptically symmetric with covariance matrix proportional to $ I_n \otimes Σ$\,. We deal with a canonical form of this model where $Y$ is decomposed in two matrices, namely, $Z_{q\times p}$ which summarizes the information contained in $ M $, and $ U_{m\times p}$, where $m=n-q$, which summarizes the sufficient information to estimate $ Σ$. As the natural estimators of the form ${\hat Σ}_a=a\, S$ (where $ S=U^{T}\,U$ and $a$ is a positive constant) perform poorly when $p >m$ (S non-invertible), we propose estimators of the form ${\hatΣ}_{a, G} = a\big( S+ S \, {S^{+}\,G(Z,S)}\big)$ where ${S^{+}}$ is the Moore-Penrose inverse of $ S$ (which coincides with $S^{-1}$ when $S$ is invertible). We provide conditions on the correction matrix $SS^{+}{G(Z,S)}$ such that ${\hat Σ}_{a, G}$ improves over ${\hat Σ}_a$ under the data-based loss $L _S( Σ, \hat { Σ}) ={\rm tr} \big ( S^{+}Σ\,({\hatΣ} \, Σ ^ {- 1} - {I}_ {p} )^ {2}\big) $. We adopt a unified approach of the two cases where $ S$ is invertible ($p \leq m$) and $ S$ is non-invertible ($p>m$). △ Less

Submitted 30 May, 2020; originally announced June 2020.

Showing 1–13 of 13 results for author: Haddouche, M