-
A General Family of Stochastic Proximal Gradient Methods for Deep Learning
Authors:
Jihun Yun,
Aurelie C. Lozano,
Eunho Yang
Abstract:
We study the training of regularized neural networks where the regularizer can be non-smooth and non-convex. We propose a unified framework for stochastic proximal gradient descent, which we term ProxGen, that allows for arbitrary positive preconditioners and lower semi-continuous regularizers. Our framework encompasses standard stochastic proximal gradient methods without preconditioners as speci…
▽ More
We study the training of regularized neural networks where the regularizer can be non-smooth and non-convex. We propose a unified framework for stochastic proximal gradient descent, which we term ProxGen, that allows for arbitrary positive preconditioners and lower semi-continuous regularizers. Our framework encompasses standard stochastic proximal gradient methods without preconditioners as special cases, which have been extensively studied in various settings. Not only that, we present two important update rules beyond the well-known standard methods as a byproduct of our approach: (i) the first closed-form proximal map**s of $\ell_q$ regularization ($0 \leq q \leq 1$) for adaptive stochastic gradient methods, and (ii) a revised version of ProxQuant that fixes a caveat of the original approach for quantization-specific regularizers. We analyze the convergence of ProxGen and show that the whole family of ProxGen enjoys the same convergence rate as stochastic proximal gradient descent without preconditioners. We also empirically show the superiority of proximal methods compared to subgradient-based approaches via extensive experiments. Interestingly, our results indicate that proximal methods with non-convex regularizers are more effective than those with convex regularizers.
△ Less
Submitted 15 July, 2020;
originally announced July 2020.
-
A Revision of Neural Tangent Kernel-based Approaches for Neural Networks
Authors:
Kyung-Su Kim,
Aurélie C. Lozano,
Eunho Yang
Abstract:
Recent theoretical works based on the neural tangent kernel (NTK) have shed light on the optimization and generalization of over-parameterized networks, and partially bridge the gap between their practical success and classical learning theory. Especially, using the NTK-based approach, the following three representative results were obtained: (1) A training error bound was derived to show that net…
▽ More
Recent theoretical works based on the neural tangent kernel (NTK) have shed light on the optimization and generalization of over-parameterized networks, and partially bridge the gap between their practical success and classical learning theory. Especially, using the NTK-based approach, the following three representative results were obtained: (1) A training error bound was derived to show that networks can fit any finite training sample perfectly by reflecting a tighter characterization of training speed depending on the data complexity. (2) A generalization error bound invariant of network size was derived by using a data-dependent complexity measure (CMD). It follows from this CMD bound that networks can generalize arbitrary smooth functions. (3) A simple and analytic kernel function was derived as indeed equivalent to a fully-trained network. This kernel outperforms its corresponding network and the existing gold standard, Random Forests, in few shot learning. For all of these results to hold, the network scaling factor $κ$ should decrease w.r.t. sample size n. In this case of decreasing $κ$, however, we prove that the aforementioned results are surprisingly erroneous. It is because the output value of trained network decreases to zero when $κ$ decreases w.r.t. n. To solve this problem, we tighten key bounds by essentially removing $κ$-affected values. Our tighter analysis resolves the scaling problem and enables the validation of the original NTK-based results.
△ Less
Submitted 6 August, 2020; v1 submitted 2 July, 2020;
originally announced July 2020.
-
Stochastic Gradient Methods with Block Diagonal Matrix Adaptation
Authors:
Jihun Yun,
Aurelie C. Lozano,
Eunho Yang
Abstract:
Adaptive gradient approaches that automatically adjust the learning rate on a per-feature basis have been very popular for training deep networks. This rich class of algorithms includes Adagrad, RMSprop, Adam, and recent extensions. All these algorithms have adopted diagonal matrix adaptation, due to the prohibitive computational burden of manipulating full matrices in high-dimensions. In this pap…
▽ More
Adaptive gradient approaches that automatically adjust the learning rate on a per-feature basis have been very popular for training deep networks. This rich class of algorithms includes Adagrad, RMSprop, Adam, and recent extensions. All these algorithms have adopted diagonal matrix adaptation, due to the prohibitive computational burden of manipulating full matrices in high-dimensions. In this paper, we show that block-diagonal matrix adaptation can be a practical and powerful solution that can effectively utilize structural characteristics of deep learning architectures, and significantly improve convergence and out-of-sample generalization. We present a general framework with block-diagonal matrix updates via coordinate grou**, which includes counterparts of the aforementioned algorithms, prove their convergence in non-convex optimization, highlighting benefits compared to diagonal versions. In addition, we propose an efficient spectrum-clip** scheme that benefits from superior generalization performance of Sgd. Extensive experiments reveal that block-diagonal approaches achieve state-of-the-art results on several deep learning tasks, and can outperform adaptive diagonal methods, vanilla Sgd, as well as a modified version of full-matrix adaptation proposed very recently.
△ Less
Submitted 26 May, 2019;
originally announced May 2019.
-
Removing Clouds and Recovering Ground Observations in Satellite Image Sequences via Temporally Contiguous Robust Matrix Completion
Authors:
Jialei Wang,
Peder A. Olsen,
Andrew R. Conn,
Aurelie C. Lozano
Abstract:
We consider the problem of removing and replacing clouds in satellite image sequences, which has a wide range of applications in remote sensing. Our approach first detects and removes the cloud-contaminated part of the image sequences. It then recovers the missing scenes from the clean parts using the proposed "TECROMAC" (TEmporally Contiguous RObust MAtrix Completion) objective. The objective fun…
▽ More
We consider the problem of removing and replacing clouds in satellite image sequences, which has a wide range of applications in remote sensing. Our approach first detects and removes the cloud-contaminated part of the image sequences. It then recovers the missing scenes from the clean parts using the proposed "TECROMAC" (TEmporally Contiguous RObust MAtrix Completion) objective. The objective function balances temporal smoothness with a low rank solution while staying close to the original observations. The matrix whose the rows are pixels and columnsare days corresponding to the image, has low-rank because the pixels reflect land-types such as vegetation, roads and lakes and there are relatively few variations as a result. We provide efficient optimization algorithms for TECROMAC, so we can exploit images containing millions of pixels. Empirical results on real satellite image sequences, as well as simulated data, demonstrate that our approach is able to recover underlying images from heavily cloud-contaminated observations.
△ Less
Submitted 13 April, 2016;
originally announced April 2016.
-
Sparse Quantile Huber Regression for Efficient and Robust Estimation
Authors:
Aleksandr Y. Aravkin,
Anju Kambadur,
Aurelie C. Lozano,
Ronny Luss
Abstract:
We consider new formulations and methods for sparse quantile regression in the high-dimensional setting. Quantile regression plays an important role in many applications, including outlier-robust exploratory analysis in gene selection. In addition, the sparsity consideration in quantile regression enables the exploration of the entire conditional distribution of the response variable given the pre…
▽ More
We consider new formulations and methods for sparse quantile regression in the high-dimensional setting. Quantile regression plays an important role in many applications, including outlier-robust exploratory analysis in gene selection. In addition, the sparsity consideration in quantile regression enables the exploration of the entire conditional distribution of the response variable given the predictors and therefore yields a more comprehensive view of the important predictors. We propose a generalized OMP algorithm for variable selection, taking the misfit loss to be either the traditional quantile loss or a smooth version we call quantile Huber, and compare the resulting greedy approaches with convex sparsity-regularized formulations. We apply a recently proposed interior point methodology to efficiently solve all convex formulations as well as convex subproblems in the generalized OMP setting, pro- vide theoretical guarantees of consistent estimation, and demonstrate the performance of our approach using empirical studies of simulated and genomic datasets.
△ Less
Submitted 19 February, 2014;
originally announced February 2014.
-
Scalable Matrix-valued Kernel Learning for High-dimensional Nonlinear Multivariate Regression and Granger Causality
Authors:
Vikas Sindhwani,
Minh Ha Quang,
Aurelie C. Lozano
Abstract:
We propose a general matrix-valued multiple kernel learning framework for high-dimensional nonlinear multivariate regression problems. This framework allows a broad class of mixed norm regularizers, including those that induce sparsity, to be imposed on a dictionary of vector-valued Reproducing Kernel Hilbert Spaces. We develop a highly scalable and eigendecomposition-free algorithm that orchestra…
▽ More
We propose a general matrix-valued multiple kernel learning framework for high-dimensional nonlinear multivariate regression problems. This framework allows a broad class of mixed norm regularizers, including those that induce sparsity, to be imposed on a dictionary of vector-valued Reproducing Kernel Hilbert Spaces. We develop a highly scalable and eigendecomposition-free algorithm that orchestrates two inexact solvers for simultaneously learning both the input and output components of separable matrix-valued kernels. As a key application enabled by our framework, we show how high-dimensional causal inference tasks can be naturally cast as sparse function estimation problems, leading to novel nonlinear extensions of a class of Graphical Granger Causality techniques. Our algorithmic developments and extensive empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds.
△ Less
Submitted 7 March, 2013; v1 submitted 17 October, 2012;
originally announced October 2012.