Search | arXiv e-print repository

Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation

Authors: Can Yaras, Peng Wang, Laura Balzano, Qing Qu

Abstract: While overparameterization in machine learning models offers great benefits in terms of optimization and generalization, it also leads to increased computational requirements as model sizes grow. In this work, we show that by leveraging the inherent low-dimensional structures of data and compressible dynamics within the model parameters, we can reap the benefits of overparameterization without the… ▽ More While overparameterization in machine learning models offers great benefits in terms of optimization and generalization, it also leads to increased computational requirements as model sizes grow. In this work, we show that by leveraging the inherent low-dimensional structures of data and compressible dynamics within the model parameters, we can reap the benefits of overparameterization without the computational burdens. In practice, we demonstrate the effectiveness of this approach for deep low-rank matrix completion as well as fine-tuning language models. Our approach is grounded in theoretical findings for deep overparameterized low-rank matrix recovery, where we show that the learning dynamics of each weight matrix are confined to an invariant low-dimensional subspace. Consequently, we can construct and train compact, highly compressed factorizations possessing the same benefits as their overparameterized counterparts. In the context of deep matrix completion, our technique substantially improves training efficiency while retaining the advantages of overparameterization. For language model fine-tuning, we propose a method called "Deep LoRA", which improves the existing low-rank adaptation (LoRA) technique, leading to reduced overfitting and a simplified hyperparameter setup, while maintaining comparable efficiency. We validate the effectiveness of Deep LoRA on natural language tasks, particularly when fine-tuning with limited data. Our code is available at https://github.com/cjyaras/deep-lora-transformers. △ Less

Submitted 9 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

Comments: Accepted at ICML'24 (Oral)

arXiv:2311.02960 [pdf, other]

Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination

Authors: Peng Wang, Xiao Li, Can Yaras, Zhihui Zhu, Laura Balzano, Wei Hu, Qing Qu

Abstract: Over the past decade, deep learning has proven to be a highly effective tool for learning meaningful features from raw data. However, it remains an open question how deep networks perform hierarchical feature learning across layers. In this work, we attempt to unveil this mystery by investigating the structures of intermediate features. Motivated by our empirical findings that linear layers mimic… ▽ More Over the past decade, deep learning has proven to be a highly effective tool for learning meaningful features from raw data. However, it remains an open question how deep networks perform hierarchical feature learning across layers. In this work, we attempt to unveil this mystery by investigating the structures of intermediate features. Motivated by our empirical findings that linear layers mimic the roles of deep layers in nonlinear networks for feature learning, we explore how deep linear networks transform input data into output by investigating the output (i.e., features) of each layer after training in the context of multi-class classification problems. Toward this goal, we first define metrics to measure within-class compression and between-class discrimination of intermediate features, respectively. Through theoretical analysis of these two metrics, we show that the evolution of features follows a simple and quantitative pattern from shallow to deep layers when the input data is nearly orthogonal and the network weights are minimum-norm, balanced, and approximate low-rank: Each layer of the linear network progressively compresses within-class features at a geometric rate and discriminates between-class features at a linear rate with respect to the number of layers that data have passed through. To the best of our knowledge, this is the first quantitative characterization of feature evolution in hierarchical representations of deep linear networks. Empirically, our extensive experiments not only validate our theoretical results numerically but also reveal a similar pattern in deep nonlinear networks which aligns well with recent empirical studies. Moreover, we demonstrate the practical implications of our results in transfer learning. Our code is available at \url{https://github.com/Heimine/PNC_DLN}. △ Less

Submitted 9 January, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

Comments: 61 pages, 14 figures

arXiv:2306.01154 [pdf, other]

The Law of Parsimony in Gradient Descent for Learning Deep Linear Networks

Authors: Can Yaras, Peng Wang, Wei Hu, Zhihui Zhu, Laura Balzano, Qing Qu

Abstract: Over the past few years, an extensively studied phenomenon in training deep networks is the implicit bias of gradient descent towards parsimonious solutions. In this work, we investigate this phenomenon by narrowing our focus to deep linear networks. Through our analysis, we reveal a surprising "law of parsimony" in the learning dynamics when the data possesses low-dimensional structures. Specific… ▽ More Over the past few years, an extensively studied phenomenon in training deep networks is the implicit bias of gradient descent towards parsimonious solutions. In this work, we investigate this phenomenon by narrowing our focus to deep linear networks. Through our analysis, we reveal a surprising "law of parsimony" in the learning dynamics when the data possesses low-dimensional structures. Specifically, we show that the evolution of gradient descent starting from orthogonal initialization only affects a minimal portion of singular vector spaces across all weight matrices. In other words, the learning process happens only within a small invariant subspace of each weight matrix, despite the fact that all weight parameters are updated throughout training. This simplicity in learning dynamics could have significant implications for both efficient training and a better understanding of deep networks. First, the analysis enables us to considerably improve training efficiency by taking advantage of the low-dimensional structure in learning dynamics. We can construct smaller, equivalent deep linear networks without sacrificing the benefits associated with the wider counterparts. Second, it allows us to better understand deep representation learning by elucidating the linear progressive separation and concentration of representations from shallow to deep layers. We also conduct numerical experiments to support our theoretical results. The code for our experiments can be found at https://github.com/cjyaras/lawofparsimony. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: The first two authors contributed to this work equally; 32 pages, 12 figures

arXiv:2209.09211 [pdf, other]

Neural Collapse with Normalized Features: A Geometric Analysis over the Riemannian Manifold

Authors: Can Yaras, Peng Wang, Zhihui Zhu, Laura Balzano, Qing Qu

Abstract: When training overparameterized deep networks for classification tasks, it has been widely observed that the learned features exhibit a so-called "neural collapse" phenomenon. More specifically, for the output features of the penultimate layer, for each class the within-class features converge to their means, and the means of different classes exhibit a certain tight frame structure, which is also… ▽ More When training overparameterized deep networks for classification tasks, it has been widely observed that the learned features exhibit a so-called "neural collapse" phenomenon. More specifically, for the output features of the penultimate layer, for each class the within-class features converge to their means, and the means of different classes exhibit a certain tight frame structure, which is also aligned with the last layer's classifier. As feature normalization in the last layer becomes a common practice in modern representation learning, in this work we theoretically justify the neural collapse phenomenon for normalized features. Based on an unconstrained feature model, we simplify the empirical loss function in a multi-class classification task into a nonconvex optimization problem over the Riemannian manifold by constraining all features and classifiers over the sphere. In this context, we analyze the nonconvex landscape of the Riemannian optimization problem over the product of spheres, showing a benign global landscape in the sense that the only global minimizers are the neural collapse solutions while all other critical points are strict saddles with negative curvature. Experimental results on practical deep networks corroborate our theory and demonstrate that better representations can be learned faster via feature normalization. △ Less

Submitted 7 March, 2023; v1 submitted 19 September, 2022; originally announced September 2022.

Comments: The first two authors contributed to this work equally; 38 pages, 13 figures. Accepted at NeurIPS'22

arXiv:2104.14032 [pdf, other]

Randomized Histogram Matching: A Simple Augmentation for Unsupervised Domain Adaptation in Overhead Imagery

Authors: Can Yaras, Kaleb Kassaw, Bohao Huang, Kyle Bradbury, Jordan M. Malof

Abstract: Modern deep neural networks (DNNs) are highly accurate on many recognition tasks for overhead (e.g., satellite) imagery. However, visual domain shifts (e.g., statistical changes due to geography, sensor, or atmospheric conditions) remain a challenge, causing the accuracy of DNNs to degrade substantially and unpredictably when testing on new sets of imagery. In this work, we model domain shifts cau… ▽ More Modern deep neural networks (DNNs) are highly accurate on many recognition tasks for overhead (e.g., satellite) imagery. However, visual domain shifts (e.g., statistical changes due to geography, sensor, or atmospheric conditions) remain a challenge, causing the accuracy of DNNs to degrade substantially and unpredictably when testing on new sets of imagery. In this work, we model domain shifts caused by variations in imaging hardware, lighting, and other conditions as non-linear pixel-wise transformations, and we perform a systematic study indicating that modern DNNs can become largely robust to these types of transformations, if provided with appropriate training data augmentation. In general, however, we do not know the transformation between two sets of imagery. To overcome this, we propose a fast real-time unsupervised training augmentation technique, termed randomized histogram matching (RHM). We conduct experiments with two large benchmark datasets for building segmentation and find that despite its simplicity, RHM consistently yields similar or superior performance compared to state-of-the-art unsupervised domain adaptation approaches, while being significantly simpler and more computationally efficient. RHM also offers substantially better performance than other comparably simple approaches that are widely used for overhead imagery. △ Less

Submitted 11 August, 2023; v1 submitted 28 April, 2021; originally announced April 2021.

Comments: Includes a main paper (10 pages). This paper is currently undergoing peer review

Showing 1–5 of 5 results for author: Yaras, C