-
Does SGD really happen in tiny subspaces?
Authors:
Minhak Song,
Kwangjun Ahn,
Chulhee Yun
Abstract:
Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies have revealed that, along the training trajectory, the gradient approximately aligns with a low-rank top eigenspace of the training loss Hessian, referred to as the dominant subspace. Given this alignment, this paper explores whether neural n…
▽ More
Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies have revealed that, along the training trajectory, the gradient approximately aligns with a low-rank top eigenspace of the training loss Hessian, referred to as the dominant subspace. Given this alignment, this paper explores whether neural networks can be trained within the dominant subspace, which, if feasible, could lead to more efficient training methods. Our primary observation is that when the SGD update is projected onto the dominant subspace, the training loss does not decrease further. This suggests that the observed alignment between the gradient and the dominant subspace is spurious. Surprisingly, projecting out the dominant subspace proves to be just as effective as the original update, despite removing the majority of the original update component. Similar observations are made for the large learning rate regime (also known as Edge of Stability) and Sharpness-Aware Minimization. We discuss the main causes and implications of this spurious alignment, shedding light on the intricate dynamics of neural network training.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
The Crucial Role of Normalization in Sharpness-Aware Minimization
Authors:
Yan Dai,
Kwangjun Ahn,
Suvrit Sra
Abstract:
Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks. Consequently, there has been a surge of interest in explaining its empirical success. We focus, in particular, on understanding the role played by normalization, a key component of the SAM updates. We theoretically an…
▽ More
Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks. Consequently, there has been a surge of interest in explaining its empirical success. We focus, in particular, on understanding the role played by normalization, a key component of the SAM updates. We theoretically and empirically study the effect of normalization in SAM for both convex and non-convex functions, revealing two key roles played by normalization: i) it helps in stabilizing the algorithm; and ii) it enables the algorithm to drift along a continuum (manifold) of minima -- a property identified by recent theoretical works that is the key to better performance. We further argue that these two properties of normalization make SAM robust against the choice of hyper-parameters, supporting the practicality of SAM. Our conclusions are backed by various experiments.
△ Less
Submitted 23 October, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Probabilistic Forecast-based Portfolio Optimization of Electricity Demand at Low Aggregation Levels
Authors:
Jungyeon Park,
Estêvão Alvarenga,
Jooyoung Jeon,
Ran Li,
Fotios Petropoulos,
Hokyun Kim,
Kwangwon Ahn
Abstract:
In the effort to achieve carbon neutrality through a decentralized electricity market, accurate short-term load forecasting at low aggregation levels has become increasingly crucial for various market participants' strategies. Accurate probabilistic forecasts at low aggregation levels can improve peer-to-peer energy sharing, demand response, and the operation of reliable distribution networks. How…
▽ More
In the effort to achieve carbon neutrality through a decentralized electricity market, accurate short-term load forecasting at low aggregation levels has become increasingly crucial for various market participants' strategies. Accurate probabilistic forecasts at low aggregation levels can improve peer-to-peer energy sharing, demand response, and the operation of reliable distribution networks. However, these applications require not only probabilistic demand forecasts, which involve quantification of the forecast uncertainty, but also determining which consumers to include in the aggregation to meet electricity supply at the forecast lead time. While research papers have been proposed on the supply side, no similar research has been conducted on the demand side. This paper presents a method for creating a portfolio that optimally aggregates demand for a given energy demand, minimizing forecast inaccuracy of overall low-level aggregation. Using probabilistic load forecasts produced by either ARMA-GARCH models or kernel density estimation (KDE), we propose three approaches to creating a portfolio of residential households' demand: Forecast Validated, Seasonal Residual, and Seasonal Similarity. An evaluation of probabilistic load forecasts demonstrates that all three approaches enhance the accuracy of forecasts produced by random portfolios, with the Seasonal Residual approach for Korea and Ireland outperforming the others in terms of both accuracy and computational efficiency.
△ Less
Submitted 18 April, 2023;
originally announced May 2023.
-
One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares
Authors:
Youngjae Min,
Kwangjun Ahn,
Navid Azizan
Abstract:
While deep neural networks are capable of achieving state-of-the-art performance in various domains, their training typically requires iterating for many passes over the dataset. However, due to computational and memory constraints and potential privacy concerns, storing and accessing all the data is impractical in many real-world scenarios where the data arrives in a stream. In this paper, we inv…
▽ More
While deep neural networks are capable of achieving state-of-the-art performance in various domains, their training typically requires iterating for many passes over the dataset. However, due to computational and memory constraints and potential privacy concerns, storing and accessing all the data is impractical in many real-world scenarios where the data arrives in a stream. In this paper, we investigate the problem of one-pass learning, in which a model is trained on sequentially arriving data without retraining on previous datapoints. Motivated by the increasing use of overparameterized models, we develop Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning which seeks to perfectly fit every new datapoint while changing the parameters in a direction that causes the least change to the predictions on previous datapoints. By doing so, we bridge two seemingly distinct algorithms in adaptive filtering and machine learning, namely the recursive least-squares (RLS) algorithm and orthogonal gradient descent (OGD). Our algorithm uses the memory efficiently by exploiting the structure of the streaming data via an incremental principal component analysis (IPCA). Further, we show that, for overparameterized linear models, the parameter vector obtained by our algorithm is what stochastic gradient descent (SGD) would converge to in the standard multi-pass setting. Finally, we generalize the results to the nonlinear setting for highly overparameterized models, relevant for deep learning. Our experiments show the effectiveness of the proposed method compared to the baselines.
△ Less
Submitted 27 July, 2022;
originally announced July 2022.
-
Reproducibility in Optimization: Theoretical Framework and Limits
Authors:
Kwangjun Ahn,
Prateek Jain,
Ziwei Ji,
Satyen Kale,
Praneeth Netrapalli,
Gil I. Shamir
Abstract:
We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions…
▽ More
We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility.
△ Less
Submitted 4 December, 2022; v1 submitted 9 February, 2022;
originally announced February 2022.
-
Agnostic Learnability of Halfspaces via Logistic Loss
Authors:
Ziwei Ji,
Kwangjun Ahn,
Pranjal Awasthi,
Satyen Kale,
Stefani Karp
Abstract:
We investigate approximation guarantees provided by logistic regression for the fundamental problem of agnostic learning of homogeneous halfspaces. Previously, for a certain broad class of "well-behaved" distributions on the examples, Diakonikolas et al. (2020) proved an $\tildeΩ(\textrm{OPT})$ lower bound, while Frei et al. (2021) proved an $\tilde{O}(\sqrt{\textrm{OPT}})$ upper bound, where…
▽ More
We investigate approximation guarantees provided by logistic regression for the fundamental problem of agnostic learning of homogeneous halfspaces. Previously, for a certain broad class of "well-behaved" distributions on the examples, Diakonikolas et al. (2020) proved an $\tildeΩ(\textrm{OPT})$ lower bound, while Frei et al. (2021) proved an $\tilde{O}(\sqrt{\textrm{OPT}})$ upper bound, where $\textrm{OPT}$ denotes the best zero-one/misclassification risk of a homogeneous halfspace. In this paper, we close this gap by constructing a well-behaved distribution such that the global minimizer of the logistic risk over this distribution only achieves $Ω(\sqrt{\textrm{OPT}})$ misclassification risk, matching the upper bound in (Frei et al., 2021). On the other hand, we also show that if we impose a radial-Lipschitzness condition in addition to well-behaved-ness on the distribution, logistic regression on a ball of bounded radius reaches $\tilde{O}(\textrm{OPT})$ misclassification risk. Our techniques also show for any well-behaved distribution, regardless of radial Lipschitzness, we can overcome the $Ω(\sqrt{\textrm{OPT}})$ lower bound for logistic loss simply at the cost of one additional convex optimization step involving the hinge loss and attain $\tilde{O}(\textrm{OPT})$ misclassification risk. This two-step convex optimization algorithm is simpler than previous methods obtaining this guarantee, all of which require solving $O(\log(1/\textrm{OPT}))$ minimization problems.
△ Less
Submitted 31 January, 2022;
originally announced January 2022.
-
Riemannian Perspective on Matrix Factorization
Authors:
Kwangjun Ahn,
Felipe Suarez
Abstract:
We study the non-convex matrix factorization approach to matrix completion via Riemannian geometry. Based on an optimization formulation over a Grassmannian manifold, we characterize the landscape based on the notion of principal angles between subspaces. For the fully observed case, our results show that there is a region in which the cost is geodesically convex, and outside of which all critical…
▽ More
We study the non-convex matrix factorization approach to matrix completion via Riemannian geometry. Based on an optimization formulation over a Grassmannian manifold, we characterize the landscape based on the notion of principal angles between subspaces. For the fully observed case, our results show that there is a region in which the cost is geodesically convex, and outside of which all critical points are strictly saddle. We empirically study the partially observed case based on our findings.
△ Less
Submitted 1 February, 2021;
originally announced February 2021.
-
SGD with shuffling: optimal rates without component convexity and large epoch requirements
Authors:
Kwangjun Ahn,
Chulhee Yun,
Suvrit Sra
Abstract:
We study without-replacement SGD for solving finite-sum optimization problems. Specifically, depending on how the indices of the finite-sum are shuffled, we consider the RandomShuffle (shuffle at the beginning of each epoch) and SingleShuffle (shuffle only once) algorithms. First, we establish minimax optimal convergence rates of these algorithms up to poly-log factors. Notably, our analysis is ge…
▽ More
We study without-replacement SGD for solving finite-sum optimization problems. Specifically, depending on how the indices of the finite-sum are shuffled, we consider the RandomShuffle (shuffle at the beginning of each epoch) and SingleShuffle (shuffle only once) algorithms. First, we establish minimax optimal convergence rates of these algorithms up to poly-log factors. Notably, our analysis is general enough to cover gradient dominated nonconvex costs, and does not rely on the convexity of individual component functions unlike existing optimal convergence results. Secondly, assuming convexity of the individual components, we further sharpen the tight convergence results for RandomShuffle by removing the drawbacks common to all prior arts: large number of epochs required for the results to hold, and extra poly-log factor gaps to the lower bound.
△ Less
Submitted 21 June, 2020; v1 submitted 12 June, 2020;
originally announced June 2020.
-
From Nesterov's Estimate Sequence to Riemannian Acceleration
Authors:
Kwangjun Ahn,
Suvrit Sra
Abstract:
We propose the first global accelerated gradient method for Riemannian manifolds. Toward establishing our result we revisit Nesterov's estimate sequence technique and develop an alternative analysis for it that may also be of independent interest. Then, we extend this analysis to the Riemannian setting, localizing the key difficulty due to non-Euclidean structure into a certain ``metric distortion…
▽ More
We propose the first global accelerated gradient method for Riemannian manifolds. Toward establishing our result we revisit Nesterov's estimate sequence technique and develop an alternative analysis for it that may also be of independent interest. Then, we extend this analysis to the Riemannian setting, localizing the key difficulty due to non-Euclidean structure into a certain ``metric distortion.'' We control this distortion by develo** a novel geometric inequality, which permits us to propose and analyze a Riemannian counterpart to Nesterov's accelerated gradient method.
△ Less
Submitted 23 January, 2020;
originally announced January 2020.
-
Regression Models Using Shapes of Functions as Predictors
Authors:
Kyungmin Ahn,
J. Derek Tucker,
Wei Wu,
Anuj Srivastava
Abstract:
Functional variables are often used as predictors in regression problems. A commonly-used parametric approach, called {\it scalar-on-function regression}, uses the $\ltwo$ inner product to map functional predictors into scalar responses. This method can perform poorly when predictor functions contain undesired phase variability, causing phases to have disproportionately large influence on the resp…
▽ More
Functional variables are often used as predictors in regression problems. A commonly-used parametric approach, called {\it scalar-on-function regression}, uses the $\ltwo$ inner product to map functional predictors into scalar responses. This method can perform poorly when predictor functions contain undesired phase variability, causing phases to have disproportionately large influence on the response variable. One past solution has been to perform phase-amplitude separation (as a pre-processing step) and then use only the amplitudes in the regression model. Here we propose a more integrated approach, termed elastic functional regression model (EFRM), where phase-separation is performed inside the regression model, rather than as a pre-processing step. This approach generalizes the notion of phase in functional data, and is based on the norm-preserving time war** of predictors. Due to its invariance properties, this representation provides robustness to predictor phase variability and results in improved predictions of the response variable over traditional models. We demonstrate this framework using a number of datasets involving gait signals, NMR data, and stock market prices.
△ Less
Submitted 25 May, 2020; v1 submitted 5 September, 2019;
originally announced September 2019.
-
Hypergraph Spectral Clustering in the Weighted Stochastic Block Model
Authors:
Kwangjun Ahn,
Kangwook Lee,
Changho Suh
Abstract:
Spectral clustering is a celebrated algorithm that partitions objects based on pairwise similarity information. While this approach has been successfully applied to a variety of domains, it comes with limitations. The reason is that there are many other applications in which only \emph{multi}-way similarity measures are available. This motivates us to explore the multi-way measurement setting. In…
▽ More
Spectral clustering is a celebrated algorithm that partitions objects based on pairwise similarity information. While this approach has been successfully applied to a variety of domains, it comes with limitations. The reason is that there are many other applications in which only \emph{multi}-way similarity measures are available. This motivates us to explore the multi-way measurement setting. In this work, we develop two algorithms intended for such setting: Hypergraph Spectral Clustering (HSC) and Hypergraph Spectral Clustering with Local Refinement (HSCLR). Our main contribution lies in performance analysis of the poly-time algorithms under a random hypergraph model, which we name the weighted stochastic block model, in which objects and multi-way measures are modeled as nodes and weights of hyperedges, respectively. Denoting by $n$ the number of nodes, our analysis reveals the following: (1) HSC outputs a partition which is better than a random guess if the sum of edge weights (to be explained later) is $Ω(n)$; (2) HSC outputs a partition which coincides with the hidden partition except for a vanishing fraction of nodes if the sum of edge weights is $ω(n)$; and (3) HSCLR exactly recovers the hidden partition if the sum of edge weights is on the order of $n \log n$. Our results improve upon the state of the arts recently established under the model and they firstly settle the order-wise optimal results for the binary edge weight case. Moreover, we show that our results lead to efficient sketching algorithms for subspace clustering, a computer vision application. Lastly, we show that HSCLR achieves the information-theoretic limits for a special yet practically relevant model, thereby showing no computational barrier for the case.
△ Less
Submitted 23 May, 2018;
originally announced May 2018.
-
Community Recovery in Hypergraphs
Authors:
Kwangjun Ahn,
Kangwook Lee,
Changho Suh
Abstract:
Community recovery is a central problem that arises in a wide variety of applications such as network clustering, motion segmentation, face clustering and protein complex detection. The objective of the problem is to cluster data points into distinct communities based on a set of measurements, each of which is associated with the values of a certain number of data points. While most of the prior w…
▽ More
Community recovery is a central problem that arises in a wide variety of applications such as network clustering, motion segmentation, face clustering and protein complex detection. The objective of the problem is to cluster data points into distinct communities based on a set of measurements, each of which is associated with the values of a certain number of data points. While most of the prior works focus on a setting in which the number of data points involved in a measurement is two, this work explores a generalized setting in which the number can be more than two. Motivated by applications particularly in machine learning and channel coding, we consider two types of measurements: (1) homogeneity measurement which indicates whether or not the associated data points belong to the same community; (2) parity measurement which denotes the modulo-2 sum of the values of the data points. Such measurements are possibly corrupted by Bernoulli noise. We characterize the fundamental limits on the number of measurements required to reconstruct the communities for the considered models.
△ Less
Submitted 11 September, 2017;
originally announced September 2017.