Search | arXiv e-print repository

Convergence of ease-controlled Random Reshuffling gradient Algorithms under Lipschitz smoothness

Authors: Ruggiero Seccia, Corrado Coppola, Giampaolo Liuzzi, Laura Palagi

Abstract: In this work, we consider minimizing the average of a very large number of smooth and possibly non-convex functions, and we focus on two widely used minibatch frameworks to tackle this optimization problem: Incremental Gradient (IG) and Random Reshuffling (RR). We define ease-controlled modifications of the IG/RR schemes, which require a light additional computational effort {but} can be proved to… ▽ More In this work, we consider minimizing the average of a very large number of smooth and possibly non-convex functions, and we focus on two widely used minibatch frameworks to tackle this optimization problem: Incremental Gradient (IG) and Random Reshuffling (RR). We define ease-controlled modifications of the IG/RR schemes, which require a light additional computational effort {but} can be proved to converge under {weak} and standard assumptions. In particular, we define two algorithmic schemes in which the IG/RR iteration is controlled by using a watchdog rule and a derivative-free linesearch that activates only sporadically to guarantee convergence. The two schemes differ in the watchdog and the linesearch, which are performed using either a monotonic or a non-monotonic rule. The two schemes also allow controlling the updating of the stepsize used in the main IG/RR iteration, avoiding the use of pre-set rules that may drive the stepsize to zero too fast, reducing the effort in designing effective updating rules of the stepsize. We prove convergence under the mild assumption of Lipschitz continuity of the gradients of the component functions and perform extensive computational analysis using different deep neural architectures and a benchmark of varying-size datasets. We compare our implementation with both a full batch gradient method (i.e. L-BFGS) and an implementation of IG/RR methods, proving that our algorithms require a similar computational effort compared to the other online algorithms and that the control on the learning rate may allow a faster decrease of the objective function. △ Less

Submitted 20 May, 2024; v1 submitted 4 December, 2022; originally announced December 2022.

Comments: Add author, add references, correct typos, improve imoplementation

MSC Class: 90.C.XX ACM Class: G.4.1

arXiv:2003.08123 [pdf, other]

doi 10.1007/s10898-019-00856-0

Block Layer Decomposition schemes for training Deep Neural Networks

Authors: Laura Palagi, Ruggiero Seccia

Abstract: Deep Feedforward Neural Networks' (DFNNs) weights estimation relies on the solution of a very large nonconvex optimization problem that may have many local (no global) minimizers, saddle points and large plateaus. As a consequence, optimization algorithms can be attracted toward local minimizers which can lead to bad solutions or can slow down the optimization process. Furthermore, the time needed… ▽ More Deep Feedforward Neural Networks' (DFNNs) weights estimation relies on the solution of a very large nonconvex optimization problem that may have many local (no global) minimizers, saddle points and large plateaus. As a consequence, optimization algorithms can be attracted toward local minimizers which can lead to bad solutions or can slow down the optimization process. Furthermore, the time needed to find good solutions to the training problem depends on both the number of samples and the number of variables. In this work, we show how Block Coordinate Descent (BCD) methods can be applied to improve performance of state-of-the-art algorithms by avoiding bad stationary points and flat regions. We first describe a batch BCD method ables to effectively tackle the network's depth and then we further extend the algorithm proposing a \textit{minibatch} BCD framework able to scale with respect to both the number of variables and the number of samples by embedding a BCD approach into a minibatch framework. By extensive numerical results on standard datasets for several architecture networks, we show how the application of BCD methods to the training phase of DFNNs permits to outperform standard batch and minibatch algorithms leading to an improvement on both the training phase and the generalization performance of the networks. △ Less

Submitted 18 March, 2020; originally announced March 2020.

Comments: 23 pages. J Glob Optim (2019)

arXiv:1906.06812 [pdf, other]

doi 10.1007/978-3-030-21803-4_72

A gray-box approach for curriculum learning

Authors: Francesco Foglino, Matteo Leonetti, Simone Sagratella, Ruggiero Seccia

Abstract: Curriculum learning is often employed in deep reinforcement learning to let the agent progress more quickly towards better behaviors. Numerical methods for curriculum learning in the literature provides only initial heuristic solutions, with little to no guarantee on their quality. We define a new gray-box function that, including a suitable scheduling problem, can be effectively used to reformula… ▽ More Curriculum learning is often employed in deep reinforcement learning to let the agent progress more quickly towards better behaviors. Numerical methods for curriculum learning in the literature provides only initial heuristic solutions, with little to no guarantee on their quality. We define a new gray-box function that, including a suitable scheduling problem, can be effectively used to reformulate the curriculum learning problem. We propose different efficient numerical methods to address this gray-box reformulation. Preliminary numerical results on a benchmark task in the curriculum learning literature show the viability of the proposed approach. △ Less

Submitted 16 June, 2019; originally announced June 2019.

Comments: 10 pages, 1 figure

Journal ref: Optimization of Complex Systems: Theory, Models, Algorithms and Applications, 2020, pp 720-729

Showing 1–3 of 3 results for author: Seccia, R