Skip to main content

Showing 1–15 of 15 results for author: Martens, J

Searching in archive stat. Search in all archives.
.
  1. arXiv:2302.10322  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

    Authors: Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew Brock, Samuel L Smith, Yee Whye Teh

    Abstract: Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel Sha** have made progress towards reducing our reliance on them, using insights from wide NN kernel theory to improve signal propagation in vanilla DNNs (which… ▽ More

    Submitted 20 February, 2023; originally announced February 2023.

    Comments: ICLR 2023

  2. arXiv:2206.00133  [pdf, other

    cs.LG q-bio.BM stat.ML

    Pre-training via Denoising for Molecular Property Prediction

    Authors: Sheheryar Zaidi, Michael Schaarschmidt, James Martens, Hyunjik Kim, Yee Whye Teh, Alvaro Sanchez-Gonzalez, Peter Battaglia, Razvan Pascanu, Jonathan Godwin

    Abstract: Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks. In this paper, we describe a pre-training technique based on denoising that achieves a new state-of-the-art in molecular property prediction by utilizing large datasets of 3D molecular structures at equilibrium to learn meaningful representati… ▽ More

    Submitted 24 October, 2022; v1 submitted 31 May, 2022; originally announced June 2022.

  3. arXiv:2203.08120  [pdf, other

    cs.LG stat.ML

    Deep Learning without Shortcuts: Sha** the Kernel with Tailored Rectifiers

    Authors: Guodong Zhang, Aleksandar Botev, James Martens

    Abstract: Training very deep neural networks is still an extremely challenging task. The common solution is to use shortcut connections and normalization layers, which are both crucial ingredients in the popular ResNet architecture. However, there is strong evidence to suggest that ResNets behave more like ensembles of shallower networks than truly deep ones. Recently, it was shown that deep vanilla network… ▽ More

    Submitted 15 March, 2022; originally announced March 2022.

    Comments: ICLR 2022

  4. arXiv:2111.06985  [pdf, other

    stat.ME

    Bayesian Knockoff Generators for Robust Inference Under Complex Data Structure

    Authors: Michael J. Martens, Anjishnu Banerjee, Xinran Qi, Yushu Shi

    Abstract: The recent proliferation of medical data, such as genetics and electronic health records (EHR), offers new opportunities to find novel predictors of health outcomes. Presented with a large set of candidate features, interest often lies in selecting the ones most likely to be predictive of an outcome for further study such that the goal is to control the false discovery rate (FDR) at a specified le… ▽ More

    Submitted 12 November, 2021; originally announced November 2021.

  5. arXiv:1907.04164  [pdf, other

    cs.LG stat.ML

    Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

    Authors: Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, Roger Grosse

    Abstract: Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments, and analysis of a simple noi… ▽ More

    Submitted 28 October, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

    Comments: NeurIPS 2019

  6. arXiv:1907.02610  [pdf, other

    stat.ML cs.LG

    Adversarial Robustness through Local Linearization

    Authors: Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy Dvijotham, Alhussein Fawzi, Soham De, Robert Stanforth, Pushmeet Kohli

    Abstract: Adversarial training is an effective methodology for training deep neural networks that are robust against adversarial, norm-bounded perturbations. However, the computational cost of adversarial training grows prohibitively as the size of the model and number of input dimensions increase. Further, training against less expensive and therefore weaker adversaries produces models that are robust agai… ▽ More

    Submitted 10 October, 2019; v1 submitted 4 July, 2019; originally announced July 2019.

  7. arXiv:1905.10961  [pdf, other

    stat.ML cs.LG

    Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks

    Authors: Guodong Zhang, James Martens, Roger Grosse

    Abstract: Natural gradient descent has proven effective at mitigating the effects of pathological curvature in neural network optimization, but little is known theoretically about its convergence properties, especially for \emph{nonlinear} networks. In this work, we analyze for the first time the speed of convergence of natural gradient descent on nonlinear neural networks with squared-error loss. We identi… ▽ More

    Submitted 28 October, 2019; v1 submitted 26 May, 2019; originally announced May 2019.

    Comments: NeurIPS 2019

  8. arXiv:1905.04926  [pdf, other

    cs.LG cs.GT cs.MA cs.NE stat.ML

    Differentiable Game Mechanics

    Authors: Alistair Letcher, David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, Thore Graepel

    Abstract: Deep learning is built on the foundational guarantee that gradient descent on an objective function converges to local minima. Unfortunately, this guarantee fails in settings, such as generative adversarial nets, that exhibit multiple interacting losses. The behavior of gradient-based methods in games is not well understood -- and is becoming increasingly important as adversarial and multi-objecti… ▽ More

    Submitted 13 May, 2019; originally announced May 2019.

    Comments: JMLR 2019, journal version of arXiv:1802.05642

    Journal ref: Journal of Machine Learning Research (JMLR), v20 (84) 1-40, 2019

  9. arXiv:1902.02405  [pdf, other

    cs.LG stat.ML

    On the Variance of Unbiased Online Recurrent Optimization

    Authors: Tim Cooijmans, James Martens

    Abstract: The recently proposed Unbiased Online Recurrent Optimization algorithm (UORO, arXiv:1702.05043) uses an unbiased approximation of RTRL to achieve fully online gradient-based learning in RNNs. In this work we analyze the variance of the gradient estimate computed by UORO, and propose several possible changes to the method which reduce this variance both in theory and practice. We also contribute si… ▽ More

    Submitted 6 February, 2019; originally announced February 2019.

  10. arXiv:1602.01407  [pdf, other

    stat.ML cs.LG

    A Kronecker-factored approximate Fisher matrix for convolution layers

    Authors: Roger Grosse, James Martens

    Abstract: Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function. Unfortunately, the exact natural gradient is impractical to compute for large models, and most approximations either require an expensive iterative procedure or make crude approximations to the curvature. We present Kron… ▽ More

    Submitted 23 May, 2016; v1 submitted 3 February, 2016; originally announced February 2016.

  11. arXiv:1511.06807  [pdf, other

    stat.ML cs.LG

    Adding Gradient Noise Improves Learning for Very Deep Networks

    Authors: Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, James Martens

    Abstract: Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. This success is partially attributed to architectural innovations such as convolutional and long short-term memory networks. The main motivation for these architectural innovations is that they capture better domain knowledge, and importantly are easier to optimize than… ▽ More

    Submitted 20 November, 2015; originally announced November 2015.

  12. arXiv:1503.05671  [pdf, other

    cs.LG cs.NE stat.ML

    Optimizing Neural Networks with Kronecker-factored Approximate Curvature

    Authors: James Martens, Roger Grosse

    Abstract: We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks o… ▽ More

    Submitted 7 June, 2020; v1 submitted 19 March, 2015; originally announced March 2015.

    Comments: Reduction ratio formula corrected. Removed incorrect claim about geodesics in footnote

  13. arXiv:1412.1193  [pdf, other

    cs.LG stat.ML

    New insights and perspectives on the natural gradient method

    Authors: James Martens

    Abstract: Natural gradient descent is an optimization method traditionally motivated from the perspective of information geometry, and works well for many applications as an alternative to stochastic gradient descent. In this paper we critically analyze this method and its properties, and show how it can be viewed as a type of 2nd-order optimization method, with the Fisher information matrix acting as a sub… ▽ More

    Submitted 19 September, 2020; v1 submitted 3 December, 2014; originally announced December 2014.

    Comments: Minor corrections from previous version and fixed typos. Official JMLR version

  14. arXiv:1411.7717  [pdf, ps, other

    cs.LG stat.ML

    On the Expressive Efficiency of Sum Product Networks

    Authors: James Martens, Venkatesh Medabalimi

    Abstract: Sum Product Networks (SPNs) are a recently developed class of deep generative models which compute their associated unnormalized density functions using a special type of arithmetic circuit. When certain sufficient conditions, called the decomposability and completeness conditions (or "D&C" conditions), are imposed on the structure of these circuits, marginal densities and other useful quantities,… ▽ More

    Submitted 22 January, 2015; v1 submitted 27 November, 2014; originally announced November 2014.

    Comments: Various minor revisions and corrections throughout

  15. arXiv:1206.6464  [pdf

    cs.LG stat.ML

    Estimating the Hessian by Back-propagating Curvature

    Authors: James Martens, Ilya Sutskever, Kevin Swersky

    Abstract: In this work we develop Curvature Propagation (CP), a general technique for efficiently computing unbiased approximations of the Hessian of any function that is computed using a computational graph. At the cost of roughly two gradient evaluations, CP can give a rank-1 approximation of the whole Hessian, and can be repeatedly applied to give increasingly precise unbiased estimates of any or all of… ▽ More

    Submitted 4 September, 2012; v1 submitted 27 June, 2012; originally announced June 2012.

    Comments: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)