-
PETRA: Parallel End-to-end Training with Reversible Architectures
Authors:
Stéphane Rivaud,
Louis Fournier,
Thomas Pumir,
Eugene Belilovsky,
Michael Eickenberg,
Edouard Oyallon
Abstract:
Reversible architectures have been shown to be capable of performing on par with their non-reversible architectures, being applied in deep learning for memory savings and generative modeling. In this work, we show how reversible architectures can solve challenges in parallelizing deep model training. We introduce PETRA, a novel alternative to backpropagation for parallelizing gradient computations…
▽ More
Reversible architectures have been shown to be capable of performing on par with their non-reversible architectures, being applied in deep learning for memory savings and generative modeling. In this work, we show how reversible architectures can solve challenges in parallelizing deep model training. We introduce PETRA, a novel alternative to backpropagation for parallelizing gradient computations. PETRA facilitates effective model parallelism by enabling stages (i.e., a set of layers) to compute independently on different devices, while only needing to communicate activations and gradients between each other. By decoupling the forward and backward passes and kee** a single updated version of the parameters, the need for weight stashing is also removed. We develop a custom autograd-like training framework for PETRA, and we demonstrate its effectiveness on CIFAR-10, ImageNet32, and ImageNet, achieving competitive accuracies comparable to backpropagation using ResNet-18, ResNet-34, and ResNet-50 models.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Can Forward Gradient Match Backpropagation?
Authors:
Louis Fournier,
Stéphane Rivaud,
Eugene Belilovsky,
Michael Eickenberg,
Edouard Oyallon
Abstract:
Forward Gradients - the idea of using directional derivatives in forward differentiation mode - have recently been shown to be utilizable for neural network training while avoiding problems generally associated with backpropagation gradient computation, such as locking and memorization requirements. The cost is the requirement to guess the step direction, which is hard in high dimensions. While c…
▽ More
Forward Gradients - the idea of using directional derivatives in forward differentiation mode - have recently been shown to be utilizable for neural network training while avoiding problems generally associated with backpropagation gradient computation, such as locking and memorization requirements. The cost is the requirement to guess the step direction, which is hard in high dimensions. While current solutions rely on weighted averages over isotropic guess vector distributions, we propose to strongly bias our gradient guesses in directions that are much more promising, such as feedback obtained from small, local auxiliary networks. For a standard computer vision neural network, we conduct a rigorous study systematically covering a variety of combinations of gradient targets and gradient guesses, including those previously presented in the literature. We find that using gradients obtained from a local loss as a candidate direction drastically improves on random noise in Forward Gradient methods.
△ Less
Submitted 12 June, 2023;
originally announced June 2023.
-
Sampling Markov Models under Constraints: Complexity Results for Binary Equalities and Grammar Membership
Authors:
Stephane Rivaud,
François Pachet
Abstract:
We aim at enforcing hard constraints to impose a global structure on sequences generated from Markov models. In this report, we study the complexity of sampling Markov sequences under two classes of constraints: Binary Equalities and Grammar Membership Constraints. First, we give a sketch of proof of #P-completeness for binary equalities and identify three sub-cases where sampling is polynomial. W…
▽ More
We aim at enforcing hard constraints to impose a global structure on sequences generated from Markov models. In this report, we study the complexity of sampling Markov sequences under two classes of constraints: Binary Equalities and Grammar Membership Constraints. First, we give a sketch of proof of #P-completeness for binary equalities and identify three sub-cases where sampling is polynomial. We then give a proof of #P-completeness for grammar membership, and identify two cases where sampling is tractable. The first polynomial sub-case where sampling is tractable is when the grammar is proven to be unambiguous. Our main contribution is to identify a new, broader class of grammars for which sampling is tractable. We provide algorithm along with time and space complexity for all the polynomial cases we have identified.
△ Less
Submitted 28 November, 2017;
originally announced November 2017.