Search | arXiv e-print repository

arXiv:2406.02052 [pdf, other]

PETRA: Parallel End-to-end Training with Reversible Architectures

Authors: Stéphane Rivaud, Louis Fournier, Thomas Pumir, Eugene Belilovsky, Michael Eickenberg, Edouard Oyallon

Abstract: Reversible architectures have been shown to be capable of performing on par with their non-reversible architectures, being applied in deep learning for memory savings and generative modeling. In this work, we show how reversible architectures can solve challenges in parallelizing deep model training. We introduce PETRA, a novel alternative to backpropagation for parallelizing gradient computations… ▽ More Reversible architectures have been shown to be capable of performing on par with their non-reversible architectures, being applied in deep learning for memory savings and generative modeling. In this work, we show how reversible architectures can solve challenges in parallelizing deep model training. We introduce PETRA, a novel alternative to backpropagation for parallelizing gradient computations. PETRA facilitates effective model parallelism by enabling stages (i.e., a set of layers) to compute independently on different devices, while only needing to communicate activations and gradients between each other. By decoupling the forward and backward passes and kee** a single updated version of the parameters, the need for weight stashing is also removed. We develop a custom autograd-like training framework for PETRA, and we demonstrate its effectiveness on CIFAR-10, ImageNet32, and ImageNet, achieving competitive accuracies comparable to backpropagation using ResNet-18, ResNet-34, and ResNet-50 models. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2011.03358 [pdf, ps, other]

Generalization of Quasi-Newton Methods: Application to Robust Symmetric Multisecant Updates

Authors: Damien Scieur, Lewis Liu, Thomas Pumir, Nicolas Boumal

Abstract: Quasi-Newton techniques approximate the Newton step by estimating the Hessian using the so-called secant equations. Some of these methods compute the Hessian using several secant equations but produce non-symmetric updates. Other quasi-Newton schemes, such as BFGS, enforce symmetry but cannot satisfy more than one secant equation. We propose a new type of quasi-Newton symmetric update using severa… ▽ More Quasi-Newton techniques approximate the Newton step by estimating the Hessian using the so-called secant equations. Some of these methods compute the Hessian using several secant equations but produce non-symmetric updates. Other quasi-Newton schemes, such as BFGS, enforce symmetry but cannot satisfy more than one secant equation. We propose a new type of quasi-Newton symmetric update using several secant equations in a least-squares sense. Our approach generalizes and unifies the design of quasi-Newton updates and satisfies provable robustness guarantees. △ Less

Submitted 8 February, 2021; v1 submitted 6 November, 2020; originally announced November 2020.

Comments: AISTATS 2021

arXiv:1907.01145 [pdf, other]

doi 10.1093/imaiai/iaaa035

The generalized orthogonal Procrustes problem in the high noise regime

Authors: Thomas Pumir, Amit Singer, Nicolas Boumal

Abstract: We consider the problem of estimating a cloud of points from numerous noisy observations of that cloud after unknown rotations, and possibly reflections. This is an instance of the general problem of estimation under group action, originally inspired by applications in 3-D imaging and computer vision. We focus on a regime where the noise level is larger than the magnitude of the signal, so much so… ▽ More We consider the problem of estimating a cloud of points from numerous noisy observations of that cloud after unknown rotations, and possibly reflections. This is an instance of the general problem of estimation under group action, originally inspired by applications in 3-D imaging and computer vision. We focus on a regime where the noise level is larger than the magnitude of the signal, so much so that the rotations cannot be estimated reliably. We propose a simple and efficient procedure based on invariant polynomials (effectively: the Gram matrices) to recover the signal, and we assess it against fundamental limits of the problem that we derive. We show our approach adapts to the noise level and is statistically optimal (up to constants) for both the low and high noise regimes. In studying the variance of our estimator, we encounter the question of the sensivity of a type of thin Cholesky factorization, for which we provide an improved bound which may be of independent interest. △ Less

Submitted 23 May, 2021; v1 submitted 1 July, 2019; originally announced July 2019.

MSC Class: 34K30; 35K57; 35Q80; 92D25

Journal ref: Information and Inference: A Journal of the IMA, iaaa035, 2021

arXiv:1806.03763 [pdf, other]

Smoothed analysis of the low-rank approach for smooth semidefinite programs

Authors: Thomas Pumir, Samy Jelassi, Nicolas Boumal

Abstract: We consider semidefinite programs (SDPs) of size n with equality constraints. In order to overcome scalability issues, Burer and Monteiro proposed a factorized approach based on optimizing over a matrix Y of size $n$ by $k$ such that $X = YY^*$ is the SDP variable. The advantages of such formulation are twofold: the dimension of the optimization variable is reduced and positive semidefiniteness is… ▽ More We consider semidefinite programs (SDPs) of size n with equality constraints. In order to overcome scalability issues, Burer and Monteiro proposed a factorized approach based on optimizing over a matrix Y of size $n$ by $k$ such that $X = YY^*$ is the SDP variable. The advantages of such formulation are twofold: the dimension of the optimization variable is reduced and positive semidefiniteness is naturally enforced. However, the problem in Y is non-convex. In prior work, it has been shown that, when the constraints on the factorized variable regularly define a smooth manifold, provided k is large enough, for almost all cost matrices, all second-order stationary points (SOSPs) are optimal. Importantly, in practice, one can only compute points which approximately satisfy necessary optimality conditions, leading to the question: are such points also approximately optimal? To this end, and under similar assumptions, we use smoothed analysis to show that approximate SOSPs for a randomly perturbed objective function are approximate global optima, with k scaling like the square root of the number of constraints (up to log factors). Moreover, we bound the optimality gap at the approximate solution of the perturbed problem with respect to the original problem. We particularize our results to an SDP relaxation of phase retrieval. △ Less

Submitted 27 November, 2018; v1 submitted 10 June, 2018; originally announced June 2018.

Showing 1–4 of 4 results for author: Pumir, T