Skip to main content

Showing 1–24 of 24 results for author: Izmailov, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.01272  [pdf, other

    cs.LG stat.ML

    Can a Confident Prior Replace a Cold Posterior?

    Authors: Martin Marek, Brooks Paige, Pavel Izmailov

    Abstract: Benchmark datasets used for image classification tend to have very low levels of label noise. When Bayesian neural networks are trained on these datasets, they often underfit, misrepresenting the aleatoric uncertainty of the data. A common solution is to cool the posterior, which improves fit to the training data but is challenging to interpret from a Bayesian perspective. We explore whether poste… ▽ More

    Submitted 2 March, 2024; originally announced March 2024.

  2. arXiv:2312.09390  [pdf, other

    cs.CL

    Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

    Authors: Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu

    Abstract: Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly su… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

  3. arXiv:2306.11074  [pdf, other

    cs.LG stat.ML

    Simple and Fast Group Robustness by Automatic Feature Reweighting

    Authors: Shikai Qiu, Andres Potapczynski, Pavel Izmailov, Andrew Gordon Wilson

    Abstract: A major challenge to out-of-distribution generalization is reliance on spurious features -- patterns that are predictive of the class label in the training data distribution, but not causally related to the target. Standard methods for reducing the reliance on spurious features typically assume that we know what the spurious feature is, which is rarely true in the real world. Methods that attempt… ▽ More

    Submitted 19 June, 2023; originally announced June 2023.

    Comments: ICML 23. Code available at https://github.com/AndPotap/afr

    Journal ref: 40th International Conference on Machine Learning 2023

  4. arXiv:2212.08013  [pdf, other

    cs.CV cs.AI cs.LG

    FlexiViT: One Model for All Patch Sizes

    Authors: Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, Filip Pavetic

    Abstract: Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of w… ▽ More

    Submitted 23 March, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

    Comments: Code and pre-trained models available at https://github.com/google-research/big_vision. All authors made significant technical contributions. CVPR 2023

  5. arXiv:2210.11369  [pdf, other

    cs.LG cs.CV stat.ML

    On Feature Learning in the Presence of Spurious Correlations

    Authors: Pavel Izmailov, Polina Kirichenko, Nate Gruver, Andrew Gordon Wilson

    Abstract: Deep classifiers are known to rely on spurious features $\unicode{x2013}$ patterns which are correlated with the target on the training data but not inherently relevant to the learning problem, such as the image backgrounds when classifying the foregrounds. In this paper we evaluate the amount of information about the core (non-spurious) features that can be decoded from the representations learne… ▽ More

    Submitted 20 October, 2022; originally announced October 2022.

    Comments: NeurIPS 2022. Code available at https://github.com/izmailovpavel/spurious_feature_learning

  6. arXiv:2204.02937  [pdf, other

    cs.LG cs.CV stat.ML

    Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations

    Authors: Polina Kirichenko, Pavel Izmailov, Andrew Gordon Wilson

    Abstract: Neural network classifiers can largely rely on simple spurious features, such as backgrounds, to make predictions. However, even in these cases, we show that they still often learn core features associated with the desired attributes of the data, contrary to recent findings. Inspired by this insight, we demonstrate that simple last layer retraining can match or outperform state-of-the-art approach… ▽ More

    Submitted 30 June, 2023; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: ICLR 2023. Code is available at https://github.com/PolinaKirichenko/deep_feature_reweighting

  7. arXiv:2203.16481  [pdf, other

    cs.LG stat.ML

    On Uncertainty, Tempering, and Data Augmentation in Bayesian Classification

    Authors: Sanyam Kapoor, Wesley J. Maddox, Pavel Izmailov, Andrew Gordon Wilson

    Abstract: Aleatoric uncertainty captures the inherent randomness of the data, such as measurement noise. In Bayesian regression, we often use a Gaussian observation model, where we control the level of aleatoric uncertainty with a noise variance parameter. By contrast, for Bayesian classification we use a categorical distribution with no mechanism to represent our beliefs about aleatoric uncertainty. Our wo… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

  8. arXiv:2202.11678  [pdf, other

    cs.LG stat.ML

    Bayesian Model Selection, the Marginal Likelihood, and Generalization

    Authors: Sanae Lotfi, Pavel Izmailov, Gregory Benton, Micah Goldblum, Andrew Gordon Wilson

    Abstract: How do we compare between hypotheses that are entirely consistent with observations? The marginal likelihood (aka Bayesian evidence), which represents the probability of generating our observations from a prior, provides a distinctive approach to this foundational question, automatically encoding Occam's razor. Although it has been observed that the marginal likelihood can overfit and is sensitive… ▽ More

    Submitted 1 May, 2023; v1 submitted 23 February, 2022; originally announced February 2022.

    Comments: Extended version. Shorter ICML version available at arXiv:2202.11678v2

  9. arXiv:2106.11905  [pdf, other

    cs.LG stat.ML

    Dangers of Bayesian Model Averaging under Covariate Shift

    Authors: Pavel Izmailov, Patrick Nicholson, Sanae Lotfi, Andrew Gordon Wilson

    Abstract: Approximate Bayesian inference for neural networks is considered a robust alternative to standard training, often providing good performance on out-of-distribution data. However, Bayesian neural networks (BNNs) with high-fidelity approximate inference via full-batch Hamiltonian Monte Carlo achieve poor generalization under covariate shift, even underperforming classical estimation. We explain this… ▽ More

    Submitted 6 December, 2021; v1 submitted 22 June, 2021; originally announced June 2021.

    Comments: NeurIPS 2021. Code is available at https://github.com/izmailovpavel/bnn_covariate_shift

  10. arXiv:2106.05945  [pdf, other

    cs.LG stat.ML

    Does Knowledge Distillation Really Work?

    Authors: Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A. Alemi, Andrew Gordon Wilson

    Abstract: Knowledge distillation is a popular technique for training a small student network to emulate a larger teacher model, such as an ensemble of networks. We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood: there often remains a surprisingly large discrepancy between the predictive distributions of the teacher and the s… ▽ More

    Submitted 6 December, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

    Comments: NeurIPS 2021. Code available at https://github.com/samuelstanton/gnosis

  11. arXiv:2104.14421  [pdf, other

    cs.LG stat.ML

    What Are Bayesian Neural Network Posteriors Really Like?

    Authors: Pavel Izmailov, Sharad Vikram, Matthew D. Hoffman, Andrew Gordon Wilson

    Abstract: The posterior over Bayesian neural network (BNN) parameters is extremely high-dimensional and non-convex. For computational reasons, researchers approximate this posterior using inexpensive mini-batch methods such as mean-field variational inference or stochastic-gradient Markov chain Monte Carlo (SGMCMC). To investigate foundational questions in Bayesian deep learning, we instead use full-batch H… ▽ More

    Submitted 29 April, 2021; originally announced April 2021.

  12. arXiv:2010.11882  [pdf, other

    cs.LG stat.ML

    Learning Invariances in Neural Networks

    Authors: Gregory Benton, Marc Finzi, Pavel Izmailov, Andrew Gordon Wilson

    Abstract: Invariances to translations have imbued convolutional neural networks with powerful generalization properties. However, we often do not know a priori what invariances are present in the data, or to what extent a model should be invariant to a given symmetry group. We show how to \emph{learn} invariances and equivariances by parameterizing a distribution over augmentations and optimizing the traini… ▽ More

    Submitted 1 December, 2020; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: NeurIPS 2020. Code available at https://github.com/g-benton/learning-invariances

  13. arXiv:2006.08545  [pdf, other

    stat.ML cs.LG

    Why Normalizing Flows Fail to Detect Out-of-Distribution Data

    Authors: Polina Kirichenko, Pavel Izmailov, Andrew Gordon Wilson

    Abstract: Detecting out-of-distribution (OOD) data is crucial for robust machine learning systems. Normalizing flows are flexible deep generative models that often surprisingly fail to distinguish between in- and out-of-distribution data: a flow trained on pictures of clothing assigns higher likelihood to handwritten digits. We investigate why normalizing flows perform poorly for OOD detection. We demonstra… ▽ More

    Submitted 15 June, 2020; originally announced June 2020.

    Comments: Code is available at https://github.com/PolinaKirichenko/flows_ood

  14. arXiv:2002.12880  [pdf, other

    stat.ML cs.LG

    Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

    Authors: Marc Finzi, Samuel Stanton, Pavel Izmailov, Andrew Gordon Wilson

    Abstract: The translation equivariance of convolutional layers enables convolutional neural networks to generalize well on image problems. While translation equivariance provides a powerful inductive bias for images, we often additionally desire equivariance to other transformations, such as rotations, especially for non-image data. We propose a general method to construct a convolutional layer that is equi… ▽ More

    Submitted 24 September, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

    Comments: ICML 2020. Code available at https://github.com/mfinzi/LieConv

  15. arXiv:2002.08791  [pdf, other

    cs.LG stat.ML

    Bayesian Deep Learning and a Probabilistic Perspective of Generalization

    Authors: Andrew Gordon Wilson, Pavel Izmailov

    Abstract: The key distinguishing property of a Bayesian approach is marginalization, rather than using a single setting of weights. Bayesian marginalization can particularly improve the accuracy and calibration of modern deep neural networks, which are typically underspecified by the data, and can represent many compelling but different solutions. We show that deep ensembles provide an effective mechanism f… ▽ More

    Submitted 30 March, 2022; v1 submitted 20 February, 2020; originally announced February 2020.

    Comments: 31 pages, 19 figures

  16. arXiv:1912.13025  [pdf, other

    cs.LG stat.ML

    Semi-Supervised Learning with Normalizing Flows

    Authors: Pavel Izmailov, Polina Kirichenko, Marc Finzi, Andrew Gordon Wilson

    Abstract: Normalizing flows transform a latent distribution through an invertible neural network for a flexible and pleasingly simple approach to generative modelling, while preserving an exact likelihood. We propose FlowGMM, an end-to-end approach to generative semi supervised learning with normalizing flows, using a latent Gaussian mixture model. FlowGMM is distinct in its simplicity, unified treatment of… ▽ More

    Submitted 30 December, 2019; originally announced December 2019.

  17. arXiv:1907.07504  [pdf, other

    cs.LG stat.ML

    Subspace Inference for Bayesian Deep Learning

    Authors: Pavel Izmailov, Wesley J. Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson

    Abstract: Bayesian inference was once a gold standard for learning with neural networks, providing accurate full predictive distributions and well calibrated uncertainty. However, scaling Bayesian inference techniques to deep neural networks is challenging due to the high dimensionality of the parameter space. In this paper, we construct low-dimensional subspaces of parameter space, such as the first princi… ▽ More

    Submitted 17 July, 2019; originally announced July 2019.

    Comments: Published at UAI 2019

  18. arXiv:1902.02476  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    A Simple Baseline for Bayesian Uncertainty in Deep Learning

    Authors: Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, Andrew Gordon Wilson

    Abstract: We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose approach for uncertainty representation and calibration in deep learning. Stochastic Weight Averaging (SWA), which computes the first moment of stochastic gradient descent (SGD) iterates with a modified learning rate schedule, has recently been shown to improve generalization in deep learning. With SWAG, we fit a Gaussian usin… ▽ More

    Submitted 31 December, 2019; v1 submitted 7 February, 2019; originally announced February 2019.

    Comments: Published at NeurIPS 2019

  19. arXiv:1806.05594  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average

    Authors: Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, Andrew Gordon Wilson

    Abstract: Presently the most successful approaches to semi-supervised learning are based on consistency regularization, whereby a model is trained to be robust to small perturbations of its inputs and parameters. To understand consistency regularization, we conceptually explore how loss geometry interacts with training procedures. The consistency loss dramatically improves generalization performance over su… ▽ More

    Submitted 21 February, 2019; v1 submitted 14 June, 2018; originally announced June 2018.

    Comments: Appears at ICLR 2019

  20. arXiv:1803.05407  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Averaging Weights Leads to Wider Optima and Better Generalization

    Authors: Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson

    Abstract: Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) proc… ▽ More

    Submitted 25 February, 2019; v1 submitted 14 March, 2018; originally announced March 2018.

    Comments: Appears at the Conference on Uncertainty in Artificial Intelligence (UAI), 2018

  21. arXiv:1802.10026  [pdf, other

    stat.ML cs.AI cs.LG

    Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

    Authors: Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, Andrew Gordon Wilson

    Abstract: The loss functions of deep neural networks are complex and their geometric properties are not well understood. We show that the optima of these complex loss functions are in fact connected by simple curves over which training and test accuracy are nearly constant. We introduce a training procedure to discover these high-accuracy pathways between modes. Inspired by this new geometric insight, we al… ▽ More

    Submitted 30 October, 2018; v1 submitted 27 February, 2018; originally announced February 2018.

    Comments: Appears at Advances in Neural Information Processing Systems (NIPS), 2018

  22. arXiv:1801.01928  [pdf, ps, other

    cs.MS math.NA

    Tensor Train decomposition on TensorFlow (T3F)

    Authors: Alexander Novikov, Pavel Izmailov, Valentin Khrulkov, Michael Figurnov, Ivan Oseledets

    Abstract: Tensor Train decomposition is used across many branches of machine learning. We present T3F -- a library for Tensor Train decomposition based on TensorFlow. T3F supports GPU execution, batch processing, automatic differentiation, and versatile functionality for the Riemannian optimization framework, which takes into account the underlying manifold structure to construct efficient optimization meth… ▽ More

    Submitted 2 March, 2020; v1 submitted 5 January, 2018; originally announced January 2018.

  23. arXiv:1710.07324  [pdf, other

    cs.LG stat.ML

    Scalable Gaussian Processes with Billions of Inducing Inputs via Tensor Train Decomposition

    Authors: Pavel Izmailov, Alexander Novikov, Dmitry Kropotov

    Abstract: We propose a method (TT-GP) for approximate inference in Gaussian Process (GP) models. We build on previous scalable GP research including stochastic variational inference based on inducing inputs, kernel interpolation, and structure exploiting algebra. The key idea of our method is to use Tensor Train decomposition for variational parameters, which allows us to train GPs with billions of inducing… ▽ More

    Submitted 17 January, 2018; v1 submitted 19 October, 2017; originally announced October 2017.

  24. arXiv:1611.06132  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Faster variational inducing input Gaussian process classification

    Authors: Pavel Izmailov, Dmitry Kropotov

    Abstract: Gaussian processes (GP) provide a prior over functions and allow finding complex regularities in data. Gaussian processes are successfully used for classification/regression problems and dimensionality reduction. In this work we consider the classification problem only. The complexity of standard methods for GP-classification scales cubically with the size of the training dataset. This complexity… ▽ More

    Submitted 18 November, 2016; originally announced November 2016.