-
Learning Algorithm Generalization Error Bounds via Auxiliary Distributions
Authors:
Gholamali Aminian,
Saeed Masiha,
Laura Toni,
Miguel R. D. Rodrigues
Abstract:
Generalization error bounds are essential for comprehending how well machine learning models work. In this work, we suggest a novel method, i.e., the Auxiliary Distribution Method, that leads to new upper bounds on expected generalization errors that are appropriate for supervised learning scenarios. We show that our general upper bounds can be specialized under some conditions to new bounds invol…
▽ More
Generalization error bounds are essential for comprehending how well machine learning models work. In this work, we suggest a novel method, i.e., the Auxiliary Distribution Method, that leads to new upper bounds on expected generalization errors that are appropriate for supervised learning scenarios. We show that our general upper bounds can be specialized under some conditions to new bounds involving the $α$-Jensen-Shannon, $α$-Rényi ($0< α< 1$) information between a random variable modeling the set of training samples and another random variable modeling the set of hypotheses. Our upper bounds based on $α$-Jensen-Shannon information are also finite. Additionally, we demonstrate how our auxiliary distribution method can be used to derive the upper bounds on excess risk of some learning algorithms in the supervised learning context {\blue and the generalization error under the distribution mismatch scenario in supervised learning algorithms, where the distribution mismatch is modeled as $α$-Jensen-Shannon or $α$-Rényi divergence between the distribution of test and training data samples distributions.} We also outline the conditions for which our proposed upper bounds might be tighter than other earlier upper bounds.
△ Less
Submitted 16 April, 2024; v1 submitted 2 October, 2022;
originally announced October 2022.
-
f-divergences and their applications in lossy compression and bounding generalization error
Authors:
Saeed Masiha,
Amin Gohari,
Mohammad Hossein Yassaee
Abstract:
In this paper, we provide three applications for $f$-divergences: (i) we introduce Sanov's upper bound on the tail probability of the sum of independent random variables based on super-modular $f$-divergence and show that our generalized Sanov's bound strictly improves over ordinary one, (ii) we consider the lossy compression problem which studies the set of achievable rates for a given distortion…
▽ More
In this paper, we provide three applications for $f$-divergences: (i) we introduce Sanov's upper bound on the tail probability of the sum of independent random variables based on super-modular $f$-divergence and show that our generalized Sanov's bound strictly improves over ordinary one, (ii) we consider the lossy compression problem which studies the set of achievable rates for a given distortion and code length. We extend the rate-distortion function using mutual $f$-information and provide new and strictly better bounds on achievable rates in the finite blocklength regime using super-modular $f$-divergences, and (iii) we provide a connection between the generalization error of algorithms with bounded input/output mutual $f$-information and a generalized rate-distortion problem. This connection allows us to bound the generalization error of learning algorithms using lower bounds on the $f$-rate-distortion function. Our bound is based on a new lower bound on the rate-distortion function that (for some examples) strictly improves over previously best-known bounds.
△ Less
Submitted 26 January, 2023; v1 submitted 21 June, 2022;
originally announced June 2022.
-
Stochastic Second-Order Methods Improve Best-Known Sample Complexity of SGD for Gradient-Dominated Function
Authors:
Saeed Masiha,
Saber Salehkaleybar,
Niao He,
Negar Kiyavash,
Patrick Thiran
Abstract:
We study the performance of Stochastic Cubic Regularized Newton (SCRN) on a class of functions satisfying gradient dominance property with $1\leα\le2$ which holds in a wide range of applications in machine learning and signal processing. This condition ensures that any first-order stationary point is a global optimum. We prove that the total sample complexity of SCRN in achieving $ε$-global optimu…
▽ More
We study the performance of Stochastic Cubic Regularized Newton (SCRN) on a class of functions satisfying gradient dominance property with $1\leα\le2$ which holds in a wide range of applications in machine learning and signal processing. This condition ensures that any first-order stationary point is a global optimum. We prove that the total sample complexity of SCRN in achieving $ε$-global optimum is $\mathcal{O}(ε^{-7/(2α)+1})$ for $1\leα< 3/2$ and $\mathcal{\tilde{O}}(ε^{-2/(α)})$ for $3/2\leα\le 2$. SCRN improves the best-known sample complexity of stochastic gradient descent. Even under a weak version of gradient dominance property, which is applicable to policy-based reinforcement learning (RL), SCRN achieves the same improvement over stochastic policy gradient methods. Additionally, we show that the average sample complexity of SCRN can be reduced to ${\mathcal{O}}(ε^{-2})$ for $α=1$ using a variance reduction method with time-varying batch sizes. Experimental results in various RL settings showcase the remarkable performance of SCRN compared to first-order methods.
△ Less
Submitted 20 January, 2023; v1 submitted 25 May, 2022;
originally announced May 2022.
-
Learning under Distribution Mismatch and Model Misspecification
Authors:
Saeed Masiha,
Amin Gohari,
Mohammad Hossein Yassaee,
Mohammad Reza Aref
Abstract:
We study learning algorithms when there is a mismatch between the distributions of the training and test datasets of a learning algorithm. The effect of this mismatch on the generalization error and model misspecification are quantified. Moreover, we provide a connection between the generalization error and the rate-distortion theory, which allows one to utilize bounds from the rate-distortion the…
▽ More
We study learning algorithms when there is a mismatch between the distributions of the training and test datasets of a learning algorithm. The effect of this mismatch on the generalization error and model misspecification are quantified. Moreover, we provide a connection between the generalization error and the rate-distortion theory, which allows one to utilize bounds from the rate-distortion theory to derive new bounds on the generalization error and vice versa. In particular, the rate-distortion based bound strictly improves over the earlier bound by Xu and Raginsky even when there is no mismatch. We also discuss how "auxiliary loss functions" can be utilized to obtain upper bounds on the generalization error.
△ Less
Submitted 10 August, 2022; v1 submitted 10 February, 2021;
originally announced February 2021.