Search | arXiv e-print repository

Deconstructing the Goldilocks Zone of Neural Network Initialization

Authors: Artem Vysogorets, Anna Dawid, Julia Kempe

Abstract: The second-order properties of the training loss have a massive impact on the optimization dynamics of deep learning models. Fort & Scherlis (2019) discovered that a large excess of positive curvature and local convexity of the loss Hessian is associated with highly trainable initial points located in a region coined the "Goldilocks zone". Only a handful of subsequent studies touched upon this rel… ▽ More The second-order properties of the training loss have a massive impact on the optimization dynamics of deep learning models. Fort & Scherlis (2019) discovered that a large excess of positive curvature and local convexity of the loss Hessian is associated with highly trainable initial points located in a region coined the "Goldilocks zone". Only a handful of subsequent studies touched upon this relationship, so it remains largely unexplained. In this paper, we present a rigorous and comprehensive analysis of the Goldilocks zone for homogeneous neural networks. In particular, we derive the fundamental condition resulting in excess of positive curvature of the loss, explaining and refining its conventionally accepted connection to the initialization norm. Further, we relate the excess of positive curvature to model confidence, low initial loss, and a previously unknown type of vanishing cross-entropy loss gradient. To understand the importance of excessive positive curvature for trainability of deep networks, we optimize fully-connected and convolutional architectures outside the Goldilocks zone and analyze the emergent behaviors. We find that strong model performance is not perfectly aligned with the Goldilocks zone, calling for further research into this relationship. △ Less

Submitted 4 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

arXiv:2306.07104 [pdf, other]

Unveiling the Hessian's Connection to the Decision Boundary

Authors: Mahalakshmi Sabanayagam, Freya Behrens, Urte Adomaityte, Anna Dawid

Abstract: Understanding the properties of well-generalizing minima is at the heart of deep learning research. On the one hand, the generalization of neural networks has been connected to the decision boundary complexity, which is hard to study in the high-dimensional input space. Conversely, the flatness of a minimum has become a controversial proxy for generalization. In this work, we provide the missing l… ▽ More Understanding the properties of well-generalizing minima is at the heart of deep learning research. On the one hand, the generalization of neural networks has been connected to the decision boundary complexity, which is hard to study in the high-dimensional input space. Conversely, the flatness of a minimum has become a controversial proxy for generalization. In this work, we provide the missing link between the two approaches and show that the Hessian top eigenvectors characterize the decision boundary learned by the neural network. Notably, the number of outliers in the Hessian spectrum is proportional to the complexity of the decision boundary. Based on this finding, we provide a new and straightforward approach to studying the complexity of a high-dimensional decision boundary; show that this connection naturally inspires a new generalization measure; and finally, we develop a novel margin estimation technique which, in combination with the generalization measure, precisely identifies minima with simple wide-margin boundaries. Overall, this analysis establishes the connection between the Hessian and the decision boundary and provides a new method to identify minima with simple wide-margin decision boundaries. △ Less

Submitted 12 June, 2023; originally announced June 2023.

Comments: 14 pages, 6 figures + 18-page appendices with 19 figures. Any feedback is very welcome! Code is available at https://github.com/Shmoo137/Hessian-and-Decision-Boundary

arXiv:2306.02572 [pdf, other]

Introduction to Latent Variable Energy-Based Models: A Path Towards Autonomous Machine Intelligence

Authors: Anna Dawid, Yann LeCun

Abstract: Current automated systems have crucial limitations that need to be addressed before artificial intelligence can reach human-like levels and bring new technological revolutions. Among others, our societies still lack Level 5 self-driving cars, domestic robots, and virtual assistants that learn reliable world models, reason, and plan complex action sequences. In these notes, we summarize the main id… ▽ More Current automated systems have crucial limitations that need to be addressed before artificial intelligence can reach human-like levels and bring new technological revolutions. Among others, our societies still lack Level 5 self-driving cars, domestic robots, and virtual assistants that learn reliable world models, reason, and plan complex action sequences. In these notes, we summarize the main ideas behind the architecture of autonomous intelligence of the future proposed by Yann LeCun. In particular, we introduce energy-based and latent variable models and combine their advantages in the building block of LeCun's proposal, that is, in the hierarchical joint embedding predictive architecture (H-JEPA). △ Less

Submitted 4 June, 2023; originally announced June 2023.

Comments: 23 pages + 1-page appendix, 11 figures. These notes follow the content of three lectures given by Yann LeCun during the Les Houches Summer School on Statistical Physics and Machine Learning in 2022. Feedback and comments are most welcome!

arXiv:2005.07605 [pdf, ps, other]

On Learnability under General Stochastic Processes

Authors: A. Philip Dawid, Ambuj Tewari

Abstract: Statistical learning theory under independent and identically distributed (iid) sampling and online learning theory for worst case individual sequences are two of the best developed branches of learning theory. Statistical learning under general non-iid stochastic processes is less mature. We provide two natural notions of learnability of a function class under a general stochastic process. We sho… ▽ More Statistical learning theory under independent and identically distributed (iid) sampling and online learning theory for worst case individual sequences are two of the best developed branches of learning theory. Statistical learning under general non-iid stochastic processes is less mature. We provide two natural notions of learnability of a function class under a general stochastic process. We show that both notions are in fact equivalent to online learnability. Our results hold for both binary classification and regression. △ Less

Submitted 11 March, 2022; v1 submitted 15 May, 2020; originally announced May 2020.

Comments: The regression results in the previous version have been made stronger

arXiv:1411.2636 [pdf, ps, other]

Bounding the Probability of Causation in Mediation Analysis

Authors: A. P. Dawid, R. Murtas, M. Musio

Abstract: Given empirical evidence for the dependence of an outcome variable on an exposure variable, we can typically only provide bounds for the "probability of causation" in the case of an individual who has developed the outcome after being exposed. We show how these bounds can be adapted or improved if further information becomes available. In addition to reviewing existing work on this topic, we provi… ▽ More Given empirical evidence for the dependence of an outcome variable on an exposure variable, we can typically only provide bounds for the "probability of causation" in the case of an individual who has developed the outcome after being exposed. We show how these bounds can be adapted or improved if further information becomes available. In addition to reviewing existing work on this topic, we provide a new analysis for the case where a mediating variable can be observed. In particular we show how the probability of causation can be bounded when there is no direct effect and no confounding. Keywords: Causal inference, Mediation Analysis, Probability of Causation △ Less

Submitted 10 November, 2014; originally announced November 2014.

Comments: 9 pages, 1 figure, 3 tables

MSC Class: 62A99

Journal ref: In Topics on Methodological and Applied Statistical Inference, edited by T. Di Battista, E. Moreno and W. Racugno. Springer (2016), 75-84

arXiv:1010.3425 [pdf, ps, other]

doi 10.1214/10-SS081

Identifying the consequences of dynamic treatment strategies: A decision-theoretic overview

Authors: A. Philip Dawid, Vanessa Didelez

Abstract: We consider the problem of learning about and comparing the consequences of dynamic treatment strategies on the basis of observational data. We formulate this within a probabilistic decision-theoretic framework. Our approach is compared with related work by Robins and others: in particular, we show how Robins's 'G-computation' algorithm arises naturally from this decision-theoretic perspective. Ca… ▽ More We consider the problem of learning about and comparing the consequences of dynamic treatment strategies on the basis of observational data. We formulate this within a probabilistic decision-theoretic framework. Our approach is compared with related work by Robins and others: in particular, we show how Robins's 'G-computation' algorithm arises naturally from this decision-theoretic perspective. Careful attention is paid to the mathematical and substantive conditions required to justify the use of this formula. These conditions revolve around a property we term stability, which relates the probabilistic behaviours of observational and interventional regimes. We show how an assumption of 'sequential randomization' (or 'no unmeasured confounders'), or an alternative assumption of 'sequential irrelevance', can be used to infer stability. Probabilistic influence diagrams are used to simplify manipulations, and their power and limitations are discussed. We compare our approach with alternative formulations based on causal DAGs or potential response models. We aim to show that formulating the problem of assessing dynamic treatment strategies as a problem of decision analysis brings clarity, simplicity and generality. △ Less

Submitted 17 October, 2010; originally announced October 2010.

Comments: 49 pages, 15 figures

MSC Class: 62C05; 62A01

Journal ref: Statistics Surveys 2010, Vol. 4, 184-231

Showing 1–6 of 6 results for author: Dawid, A