Search | arXiv e-print repository

Stable Nonconvex-Nonconcave Training via Linear Interpolation

Authors: Thomas Pethick, Wanyun Xie, Volkan Cevher

Abstract: This paper presents a theoretical analysis of linear interpolation as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear interpolation can help by leveraging the theory of nonexpansive operators. We construct a new optimization scheme cal… ▽ More This paper presents a theoretical analysis of linear interpolation as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear interpolation can help by leveraging the theory of nonexpansive operators. We construct a new optimization scheme called relaxed approximate proximal point (RAPP), which is the first explicit method without anchoring to achieve last iterate convergence rates for $ρ$-comonotone problems while only requiring $ρ> -\tfrac{1}{2L}$. The construction extends to constrained and regularized settings. By replacing the inner optimizer in RAPP we rediscover the family of Lookahead algorithms for which we establish convergence in cohypomonotone problems even when the base optimizer is taken to be gradient descent ascent. The range of cohypomonotone problems in which Lookahead converges is further expanded by exploiting that Lookahead inherits the properties of the base optimizer. We corroborate the results with experiments on generative adversarial networks which demonstrates the benefits of the linear interpolation present in both RAPP and Lookahead. △ Less

Submitted 14 March, 2024; v1 submitted 20 October, 2023; originally announced October 2023.

arXiv:2306.05325 [pdf, other]

Federated Learning under Covariate Shifts with Generalization Guarantees

Authors: Ali Ramezani-Kebrya, Fanghui Liu, Thomas Pethick, Grigorios Chrysos, Volkan Cevher

Abstract: This paper addresses intra-client and inter-client covariate shifts in federated learning (FL) with a focus on the overall generalization performance. To handle covariate shifts, we formulate a new global model training paradigm and propose Federated Importance-Weighted Empirical Risk Minimization (FTW-ERM) along with improving density ratio matching methods without requiring perfect knowledge of… ▽ More This paper addresses intra-client and inter-client covariate shifts in federated learning (FL) with a focus on the overall generalization performance. To handle covariate shifts, we formulate a new global model training paradigm and propose Federated Importance-Weighted Empirical Risk Minimization (FTW-ERM) along with improving density ratio matching methods without requiring perfect knowledge of the supremum over true ratios. We also propose the communication-efficient variant FITW-ERM with the same level of privacy guarantees as those of classical ERM in FL. We theoretically show that FTW-ERM achieves smaller generalization error than classical ERM under certain settings. Experimental results demonstrate the superiority of FTW-ERM over existing FL baselines in challenging imbalanced federated settings in terms of data distribution shifts across clients. △ Less

Submitted 8 June, 2023; originally announced June 2023.

Comments: Published in Transactions on Machine Learning Research (TMLR)

arXiv:2302.09831 [pdf, other]

Esca** limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems

Authors: Thomas Pethick, Puya Latafat, Panagiotis Patrinos, Olivier Fercoq, Volkan Cevher

Abstract: This paper introduces a new extragradient-type algorithm for a class of nonconvex-nonconcave minimax problems. It is well-known that finding a local solution for general minimax problems is computationally intractable. This observation has recently motivated the study of structures sufficient for convergence of first order methods in the more general setting of variational inequalities when the so… ▽ More This paper introduces a new extragradient-type algorithm for a class of nonconvex-nonconcave minimax problems. It is well-known that finding a local solution for general minimax problems is computationally intractable. This observation has recently motivated the study of structures sufficient for convergence of first order methods in the more general setting of variational inequalities when the so-called weak Minty variational inequality (MVI) holds. This problem class captures non-trivial structures as we demonstrate with examples, for which a large family of existing algorithms provably converge to limit cycles. Our results require a less restrictive parameter range in the weak MVI compared to what is previously known, thus extending the applicability of our scheme. The proposed algorithm is applicable to constrained and regularized problems, and involves an adaptive stepsize allowing for potentially larger stepsizes. Our scheme also converges globally even in settings where the underlying operator exhibits limit cycles. △ Less

Submitted 20 February, 2023; originally announced February 2023.

Comments: Code accessible at: https://github.com/LIONS-EPFL/weak-minty-code/

arXiv:2302.09029 [pdf, other]

Solving stochastic weak Minty variational inequalities without increasing batch size

Authors: Thomas Pethick, Olivier Fercoq, Puya Latafat, Panagiotis Patrinos, Volkan Cevher

Abstract: This paper introduces a family of stochastic extragradient-type algorithms for a class of nonconvex-nonconcave problems characterized by the weak Minty variational inequality (MVI). Unlike existing results on extragradient methods in the monotone setting, employing diminishing stepsizes is no longer possible in the weak MVI setting. This has led to approaches such as increasing batch sizes per ite… ▽ More This paper introduces a family of stochastic extragradient-type algorithms for a class of nonconvex-nonconcave problems characterized by the weak Minty variational inequality (MVI). Unlike existing results on extragradient methods in the monotone setting, employing diminishing stepsizes is no longer possible in the weak MVI setting. This has led to approaches such as increasing batch sizes per iteration which can however be prohibitively expensive. In contrast, our proposed methods involves two stepsizes and only requires one additional oracle evaluation per iteration. We show that it is possible to keep one fixed stepsize while it is only the second stepsize that is taken to be diminishing, making it interesting even in the monotone setting. Almost sure convergence is established and we provide a unified analysis for this family of schemes which contains a nonlinear generalization of the celebrated primal dual hybrid gradient algorithm. △ Less

Submitted 17 February, 2023; originally announced February 2023.

Comments: Code accessible at: https://github.com/LIONS-EPFL/stochastic-weak-minty-code

arXiv:2302.08872 [pdf, other]

Revisiting adversarial training for the worst-performing class

Authors: Thomas Pethick, Grigorios G. Chrysos, Volkan Cevher

Abstract: Despite progress in adversarial training (AT), there is a substantial gap between the top-performing and worst-performing classes in many datasets. For example, on CIFAR10, the accuracies for the best and worst classes are 74% and 23%, respectively. We argue that this gap can be reduced by explicitly optimizing for the worst-performing class, resulting in a min-max-max optimization formulation. Ou… ▽ More Despite progress in adversarial training (AT), there is a substantial gap between the top-performing and worst-performing classes in many datasets. For example, on CIFAR10, the accuracies for the best and worst classes are 74% and 23%, respectively. We argue that this gap can be reduced by explicitly optimizing for the worst-performing class, resulting in a min-max-max optimization formulation. Our method, called class focused online learning (CFOL), includes high probability convergence guarantees for the worst class loss and can be easily integrated into existing training setups with minimal computational overhead. We demonstrate an improvement to 32% in the worst class accuracy on CIFAR10, and we observe consistent behavior across CIFAR100 and STL10. Our study highlights the importance of moving beyond average accuracy, which is particularly important in safety-critical applications. △ Less

Submitted 17 February, 2023; originally announced February 2023.

Comments: Code accessible at: https://github.com/LIONS-EPFL/class-focused-online-learning-code

arXiv:2111.01875 [pdf, other]

Subquadratic Overparameterization for Shallow Neural Networks

Authors: Chaehwan Song, Ali Ramezani-Kebrya, Thomas Pethick, Armin Eftekhari, Volkan Cevher

Abstract: Overparameterization refers to the important phenomenon where the width of a neural network is chosen such that learning algorithms can provably attain zero loss in nonconvex training. The existing theory establishes such global convergence using various initialization strategies, training modifications, and width scalings. In particular, the state-of-the-art results require the width to scale qua… ▽ More Overparameterization refers to the important phenomenon where the width of a neural network is chosen such that learning algorithms can provably attain zero loss in nonconvex training. The existing theory establishes such global convergence using various initialization strategies, training modifications, and width scalings. In particular, the state-of-the-art results require the width to scale quadratically with the number of training data under standard initialization strategies used in practice for best generalization performance. In contrast, the most recent results obtain linear scaling either with requiring initializations that lead to the "lazy-training", or training only a single layer. In this work, we provide an analytical framework that allows us to adopt standard initialization strategies, possibly avoid lazy training, and train all layers simultaneously in basic shallow neural networks while attaining a desirable subquadratic scaling on the network width. We achieve the desiderata via Polyak-Lojasiewicz condition, smoothness, and standard assumptions on data, and use tools from random matrix theory. △ Less

Submitted 2 November, 2021; originally announced November 2021.

Comments: To appear at the conference on Neural Information Processing Systems (NeurIPS 2021)

Showing 1–6 of 6 results for author: Pethick, T