Search | arXiv e-print repository

Pruning is Optimal for Learning Sparse Features in High-Dimensions

Authors: Nuri Mert Vural, Murat A. Erdogdu

Abstract: While it is commonly observed in practice that pruning networks to a certain level of sparsity can improve the quality of the features, a theoretical explanation of this phenomenon remains elusive. In this work, we investigate this by demonstrating that a broad class of statistical models can be optimally learned using pruned neural networks trained with gradient descent, in high-dimensions. We… ▽ More While it is commonly observed in practice that pruning networks to a certain level of sparsity can improve the quality of the features, a theoretical explanation of this phenomenon remains elusive. In this work, we investigate this by demonstrating that a broad class of statistical models can be optimally learned using pruned neural networks trained with gradient descent, in high-dimensions. We consider learning both single-index and multi-index models of the form $y = σ^*(\boldsymbol{V}^{\top} \boldsymbol{x}) + ε$, where $σ^*$ is a degree-$p$ polynomial, and $\boldsymbol{V} \in \mathbbm{R}^{d \times r}$ with $r \ll d$, is the matrix containing relevant model directions. We assume that $\boldsymbol{V}$ satisfies a certain $\ell_q$-sparsity condition for matrices and show that pruning neural networks proportional to the sparsity level of $\boldsymbol{V}$ improves their sample complexity compared to unpruned networks. Furthermore, we establish Correlational Statistical Query (CSQ) lower bounds in this setting, which take the sparsity level of $\boldsymbol{V}$ into account. We show that if the sparsity level of $\boldsymbol{V}$ exceeds a certain threshold, training pruned networks with a gradient descent algorithm achieves the sample complexity suggested by the CSQ lower bound. In the same scenario, however, our results imply that basis-independent methods such as models trained via standard gradient descent initialized with rotationally invariant random weights can provably achieve only suboptimal sample complexity. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Accepted for presentation at the Conference on Learning Theory (COLT) 2024

arXiv:2202.11632 [pdf, other]

Mirror Descent Strikes Again: Optimal Stochastic Convex Optimization under Infinite Noise Variance

Authors: Nuri Mert Vural, Lu Yu, Krishnakumar Balasubramanian, Stanislav Volgushev, Murat A. Erdogdu

Abstract: We study stochastic convex optimization under infinite noise variance. Specifically, when the stochastic gradient is unbiased and has uniformly bounded $(1+κ)$-th moment, for some $κ\in (0,1]$, we quantify the convergence rate of the Stochastic Mirror Descent algorithm with a particular class of uniformly convex mirror maps, in terms of the number of iterations, dimensionality and related geometri… ▽ More We study stochastic convex optimization under infinite noise variance. Specifically, when the stochastic gradient is unbiased and has uniformly bounded $(1+κ)$-th moment, for some $κ\in (0,1]$, we quantify the convergence rate of the Stochastic Mirror Descent algorithm with a particular class of uniformly convex mirror maps, in terms of the number of iterations, dimensionality and related geometric parameters of the optimization problem. Interestingly this algorithm does not require any explicit gradient clip** or normalization, which have been extensively used in several recent empirical and theoretical works. We complement our convergence results with information-theoretic lower bounds showing that no other algorithm using only stochastic first-order oracles can achieve improved rates. Our results have several interesting consequences for devising online/streaming stochastic approximation algorithms for problems arising in robust statistics and machine learning. △ Less

Submitted 23 February, 2022; originally announced February 2022.

Comments: 31 pages, 1 figure

arXiv:2005.08948 [pdf, other]

Achieving Online Regression Performance of LSTMs with Simple RNNs

Authors: N. Mert Vural, Fatih Ilhan, Selim F. Yilmaz, Salih Ergüt, Suleyman S. Kozat

Abstract: Recurrent Neural Networks (RNNs) are widely used for online regression due to their ability to generalize nonlinear temporal dependencies. As an RNN model, Long-Short-Term-Memory Networks (LSTMs) are commonly preferred in practice, as these networks are capable of learning long-term dependencies while avoiding the vanishing gradient problem. However, due to their large number of parameters, traini… ▽ More Recurrent Neural Networks (RNNs) are widely used for online regression due to their ability to generalize nonlinear temporal dependencies. As an RNN model, Long-Short-Term-Memory Networks (LSTMs) are commonly preferred in practice, as these networks are capable of learning long-term dependencies while avoiding the vanishing gradient problem. However, due to their large number of parameters, training LSTMs requires considerably longer training time compared to simple RNNs (SRNNs). In this paper, we achieve the online regression performance of LSTMs with SRNNs efficiently. To this end, we introduce a first-order training algorithm with a linear time complexity in the number of parameters. We show that when SRNNs are trained with our algorithm, they provide very similar regression performance with the LSTMs in two to three times shorter training time. We provide strong theoretical analysis to support our experimental results by providing regret bounds on the convergence rate of our algorithm. Through an extensive set of experiments, we verify our theoretical work and demonstrate significant performance improvements of our algorithm with respect to LSTMs and the other state-of-the-art learning models. △ Less

Submitted 31 May, 2021; v1 submitted 16 May, 2020; originally announced May 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:2003.03601

arXiv:2003.03601

RNN-based Online Learning: An Efficient First-Order Optimization Algorithm with a Convergence Guarantee

Authors: N. Mert Vural, Selim F. Yilmaz, Fatih Ilhan, Suleyman S. Kozat

Abstract: We investigate online nonlinear regression with continually running recurrent neural network networks (RNNs), i.e., RNN-based online learning. For RNN-based online learning, we introduce an efficient first-order training algorithm that theoretically guarantees to converge to the optimum network parameters. Our algorithm is truly online such that it does not make any assumption on the learning envi… ▽ More We investigate online nonlinear regression with continually running recurrent neural network networks (RNNs), i.e., RNN-based online learning. For RNN-based online learning, we introduce an efficient first-order training algorithm that theoretically guarantees to converge to the optimum network parameters. Our algorithm is truly online such that it does not make any assumption on the learning environment to guarantee convergence. Through numerical simulations, we verify our theoretical results and illustrate significant performance improvements achieved by our algorithm with respect to the state-of-the-art RNN training methods. △ Less

Submitted 31 May, 2021; v1 submitted 7 March, 2020; originally announced March 2020.

Comments: This paper was an early draft of the presented results. We have written and published another paper (arXiv:2005.08948) where we have improved the material in this paper. The published paper covers most of the material presented in this paper as well. Therefore, we remove this paper from Arxiv and kindly refer the interested readers to arXiv:2005.08948

arXiv:1911.12258

Stability of the Decoupled Extended Kalman Filter Learning Algorithm in LSTM-Based Online Learning

Authors: Nuri Mert Vural, Fatih Ilhan, Suleyman S. Kozat

Abstract: We investigate the convergence and stability properties of the decoupled extended Kalman filter learning algorithm (DEKF) within the long-short term memory network (LSTM) based online learning framework. For this purpose, we model DEKF as a perturbed extended Kalman filter and derive sufficient conditions for its stability during LSTM training. We show that if the perturbations -- introduced due t… ▽ More We investigate the convergence and stability properties of the decoupled extended Kalman filter learning algorithm (DEKF) within the long-short term memory network (LSTM) based online learning framework. For this purpose, we model DEKF as a perturbed extended Kalman filter and derive sufficient conditions for its stability during LSTM training. We show that if the perturbations -- introduced due to decoupling -- stay bounded, DEKF learns LSTM parameters with similar convergence and stability properties of the global extended Kalman filter learning algorithm. We verify our results with several numerical simulations and compare DEKF with other LSTM training methods. In our simulations, we also observe that the well-known hyper-parameter selection approaches used for DEKF in the literature satisfy our conditions. △ Less

Submitted 31 May, 2021; v1 submitted 25 November, 2019; originally announced November 2019.

Comments: This paper was an early draft of the presented results. We have written and published another paper (arXiv:1911.12258) where we have improved on the material in this paper. The published paper covers most of the material presented in this paper as well. Therefore, we remove this paper from Arxiv and refer the interested readers to arXiv:1911.12258

arXiv:1911.11122 [pdf, other]

Minimax Optimal Algorithms for Adversarial Bandit Problem with Multiple Plays

Authors: N. Mert Vural, Hakan Gokcesu, Kaan Gokcesu, Suleyman S. Kozat

Abstract: We investigate the adversarial bandit problem with multiple plays under semi-bandit feedback. We introduce a highly efficient algorithm that asymptotically achieves the performance of the best switching $m$-arm strategy with minimax optimal regret bounds. To construct our algorithm, we introduce a new expert advice algorithm for the multiple-play setting. By using our expert advice algorithm, we a… ▽ More We investigate the adversarial bandit problem with multiple plays under semi-bandit feedback. We introduce a highly efficient algorithm that asymptotically achieves the performance of the best switching $m$-arm strategy with minimax optimal regret bounds. To construct our algorithm, we introduce a new expert advice algorithm for the multiple-play setting. By using our expert advice algorithm, we additionally improve the best-known high-probability bound for the multi-play setting by $O(\sqrt{m})$. Our results are guaranteed to hold in an individual sequence manner since we have no statistical assumption on the bandit arm gains. Through an extensive set of experiments involving synthetic and real data, we demonstrate significant performance gains achieved by the proposed algorithm with respect to the state-of-the-art algorithms. △ Less

Submitted 25 November, 2019; originally announced November 2019.

arXiv:1910.09857 [pdf, other]

An Efficient and Effective Second-Order Training Algorithm for LSTM-based Adaptive Learning

Authors: N. Mert Vural, Salih Ergüt, Suleyman S. Kozat

Abstract: We study adaptive (or online) nonlinear regression with Long-Short-Term-Memory (LSTM) based networks, i.e., LSTM-based adaptive learning. In this context, we introduce an efficient Extended Kalman filter (EKF) based second-order training algorithm. Our algorithm is truly online, i.e., it does not assume any underlying data generating process and future information, except that the target sequence… ▽ More We study adaptive (or online) nonlinear regression with Long-Short-Term-Memory (LSTM) based networks, i.e., LSTM-based adaptive learning. In this context, we introduce an efficient Extended Kalman filter (EKF) based second-order training algorithm. Our algorithm is truly online, i.e., it does not assume any underlying data generating process and future information, except that the target sequence is bounded. Through an extensive set of experiments, we demonstrate significant performance gains achieved by our algorithm with respect to the state-of-the-art methods. Here, we mainly show that our algorithm consistently provides 10 to 45\% improvement in the accuracy compared to the widely-used adaptive methods Adam, RMSprop, and DEKF, and comparable performance to EKF with a 10 to 15 times reduction in the run-time. △ Less

Submitted 31 May, 2021; v1 submitted 22 October, 2019; originally announced October 2019.

Showing 1–7 of 7 results for author: Vural, N M