Search | arXiv e-print repository

Studying Generalization Through Data Averaging

Abstract: The generalization of machine learning models has a complex dependence on the data, model and learning algorithm. We study train and test performance, as well as the generalization gap given by the mean of their difference over different data set samples to understand their ``typical" behavior. We derive an expression for the gap as a function of the covariance between the model parameter distribu… ▽ More The generalization of machine learning models has a complex dependence on the data, model and learning algorithm. We study train and test performance, as well as the generalization gap given by the mean of their difference over different data set samples to understand their ``typical" behavior. We derive an expression for the gap as a function of the covariance between the model parameter distribution and the train loss, and another expression for the average test performance, showing test generalization only depends on data-averaged parameter distribution and the data-averaged loss. We show that for a large class of model parameter distributions a modified generalization gap is always non-negative. By specializing further to parameter distributions produced by stochastic gradient descent (SGD), along with a few approximations and modeling considerations, we are able to predict some aspects about how the generalization gap and model train and test performance vary as a function of SGD noise. We evaluate these predictions empirically on the Cifar10 classification task based on a ResNet architecture. △ Less

Submitted 27 June, 2022; originally announced June 2022.

arXiv:2108.09507 [pdf, other]

Shift-Curvature, SGD, and Generalization

Authors: Arwen V. Bradley, Carlos Alberto Gomez-Uribe, Manish Reddy Vuyyuru

Abstract: A longstanding debate surrounds the related hypotheses that low-curvature minima generalize better, and that SGD discourages curvature. We offer a more complete and nuanced view in support of both. First, we show that curvature harms test performance through two new mechanisms, the shift-curvature and bias-curvature, in addition to a known parameter-covariance mechanism. The three curvature-mediat… ▽ More A longstanding debate surrounds the related hypotheses that low-curvature minima generalize better, and that SGD discourages curvature. We offer a more complete and nuanced view in support of both. First, we show that curvature harms test performance through two new mechanisms, the shift-curvature and bias-curvature, in addition to a known parameter-covariance mechanism. The three curvature-mediated contributions to test performance are reparametrization-invariant although curvature is not. The shift in the shift-curvature is the line connecting train and test local minima, which differ due to dataset sampling or distribution shift. Although the shift is unknown at training time, the shift-curvature can still be mitigated by minimizing overall curvature. Second, we derive a new, explicit SGD steady-state distribution showing that SGD optimizes an effective potential related to but different from train loss, and that SGD noise mediates a trade-off between deep versus low-curvature regions of this effective potential. Third, combining our test performance analysis with the SGD steady state shows that for small SGD noise, the shift-curvature may be the most significant of the three mechanisms. Our experiments confirm the impact of shift-curvature on test loss, and further explore the relationship between SGD noise and curvature. △ Less

Submitted 27 July, 2022; v1 submitted 21 August, 2021; originally announced August 2021.

arXiv:1806.09976 [pdf, other]

The decoupled extended Kalman filter for dynamic exponential-family factorization models

Authors: Carlos Alberto Gomez-Uribe, Brian Karrer

Abstract: Motivated by the needs of online large-scale recommender systems, we specialize the decoupled extended Kalman filter (DEKF) to factorization models, including factorization machines, matrix and tensor factorization, and illustrate the effectiveness of the approach through numerical experiments on synthetic and on real-world data. Online learning of model parameters through the DEKF makes factoriza… ▽ More Motivated by the needs of online large-scale recommender systems, we specialize the decoupled extended Kalman filter (DEKF) to factorization models, including factorization machines, matrix and tensor factorization, and illustrate the effectiveness of the approach through numerical experiments on synthetic and on real-world data. Online learning of model parameters through the DEKF makes factorization models more broadly useful by (i) allowing for more flexible observations through the entire exponential family, (ii) modeling parameter drift, and (iii) producing parameter uncertainty estimates that can enable explore/exploit and other applications. We use a different parameter dynamics than the standard DEKF, allowing parameter drift while encouraging reasonable values. We also present an alternate derivation of the extended Kalman filter and DEKF that highlights the role of the Fisher information matrix in the EKF. △ Less

Submitted 24 February, 2021; v1 submitted 26 June, 2018; originally announced June 2018.

Comments: 29 pages, 4 figures

Journal ref: Journal of Machine Learning Research (JMLR), 22(5):1-25, 2021

Showing 1–3 of 3 results for author: Gomez-Uribe, C A