MCNC: Manifold Constrained Network Compression

Chayne Thrash¹
&Ali Abbasi¹
&Parsa Nooralinejad²
&Soroush Abbasi Koohpayegani² \ANDReed Andreas¹
&Hamed Pirsiavash²
&Soheil Kolouri¹
\AND¹ Department of Computer Science, Vanderbilt University, Nashville, TN
\AND² Department of Computer Science, University of California, Davis, CA

Abstract

The outstanding performance of large foundational models across diverse tasks—from computer vision to speech and natural language processing—has significantly increased their demand. However, storing and transmitting these models pose significant challenges due to their massive size (e.g., 350GB for GPT-3). Recent literature has focused on compressing the original weights or reducing the number of parameters required for fine-tuning these models. These compression methods typically involve constraining the parameter space, for example, through low-rank reparametrization (e.g., LoRA) or quantization (e.g., QLoRA) during model training. In this paper, we present MCNC as a novel model compression method that constrains the parameter space to low-dimensional pre-defined and frozen nonlinear manifolds, which effectively cover this space. Given the prevalence of good solutions in over-parameterized deep neural networks, we show that by constraining the parameter space to our proposed manifold, we can identify high-quality solutions while achieving unprecedented compression rates across a wide variety of tasks. Through extensive experiments in computer vision and natural language processing tasks, we demonstrate that our method, MCNC, significantly outperforms state-of-the-art baselines in terms of compression, accuracy, and/or model reconstruction time.

1 Introduction

Recent state-of-the-art performance in computer vision [1, 2], natural language processing [3, 4], speech [5, 6], and other fields [7, 8] is often achieved through large transformer networks [9] with hundreds of millions to tens of billions of parameters, trained on vast amounts of data. For instance, the GPT-3 model [3], with its 175 billion parameters, requires around 350GB of memory for storage when using float16 precision. As the size of these models continues to increase and the field moves towards large foundational models, the need for effective compression methods that shrink the model size without significant loss of performance is ever-increasing. Model compression has become essential for efficient storage, communication, and potential deployment on edge devices.

Similarly, fine-tuning foundation models to custom tasks and datasets has become the gold standard for obtaining high-performing, customized models. However, besides the storage concerns for such customized models, the overhead involved in loading task-specific models and transferring weights from CPU to GPU often becomes a computational bottleneck in many applications. These concerns have led to the development of Parameter Efficient Fine-Tuning (PEFT) [10] approaches, which have become very popular in recent years. These methods focus on efficiently compressing the residual weights/parameters to adapt the foundational model to specific tasks. For instance, enforcing the residual weights to be low-rank, as proposed in LoRA [11], would significantly reduce the number of trainable parameters for fine-tuning the model.

A wide variety of methods have been proposed for the compression of large models [12]. These approaches can be roughly categorized into five principal techniques: weight sharing, quantization, pruning, knowledge distillation, and reparameterization. Related to our method, reparameterization methods [11, 13, 14, 15] focus on reducing the number of parameters or simplifying computations by restructuring or reparameterizing model weights while maintaining performance. Notably, many recent papers focus on hybrid approaches that leverage multiple principal compression techniques like quantization and reparameterization, such as QLoRA [16]. In this paper, we propose a novel nonlinear reparameterization technique for compressing the weights of large models.

Our work in this paper is inspired by the recent reparameterization methods PRANC [13] and NOLA [15]. At their core, these reparametrization techniques leverage the abundance of good solutions for overparameterized neural networks [17] with $d$ parameters and constrain the search for optimal solutions to random $k$ -dimensional subspaces within the parameter space, where $k\ll d$ . Here, we pose the following question: Could we find better solutions by increasing the complexity of our search space from a random subspace [13, 15] to a k-dimensional nonlinear manifold that efficiently covers the parameter space? Below, we further clarify the main concept in our paper through an intuitive thought experiment.

Refer to caption — Figure 1: Consider a model or a module in our model parameterized by $\theta\in\mathbb{R}^{d}$ . We use polar decomposition to represent the model parameters as $\theta=\beta u$ , where $\beta\in\mathbb{R}$ is the amplitude, $u\in\mathbb{S}^{d-1}$ is the unit vector denoting the direction, and $\mathbb{S}^{d-1}$ is the $d$ -dimensional hypersphere. The core concept in our proposed work is to wrap a bounded $k$ -dimensional subspace, $[-L,L]^{k}$ , around the hypersphere using a nonlinear generator $\phi:[-L,L]^{k}\to\mathbb{S}^{d-1}$ , and reparameterize the weights as $\theta=\beta\phi(\alpha)$ , where $\alpha\in[-L,L]^{k}$ (Panel (a)). Additionally, we can increase our modeling flexibility by constraining $\theta$ to live in the span of $\{\phi(\alpha_{i})\}_{i=1}^{m}$ , i.e., $\theta=\sum_{i=1}^{m}\beta_{i}\phi(\alpha_{i})$ , as seen in Panel (b). To train the model/module, we optimize not in the $d$ -dimensional space, but with respect to $\{(\alpha_{i},\beta_{i})\}_{i=1}^{m}$ , reducing the number of parameters from $d$ to $m(k+1)$ (Panel (c)). Lastly, the manifold-constrained weights are obtained via $\theta^{*}_{\mathcal{M}}=\sum_{i=1}^{m}\beta^{*}_{i}\phi(\alpha^{*}_{i})$ , as shown in Panel (d).

Winding a string around a sphere. Consider we want to optimize a model with parameters $\theta\in\mathbb{R}^{d}$ . Using polar decomposition, we can express $\theta$ as $\theta=\beta u$ , where $\beta\in\mathbb{R}$ is the amplitude, and $u=\theta/\|\theta\|\in\mathbb{S}^{d-1}$ is a unit vector in the direction of $\theta$ . Here, $\mathbb{S}^{d-1}$ denotes the $d$ -dimensional hypersphere. Now, imagine taking a segment of the real line $[-L,L]$ , think of a string of length $2L$ , and wrap** it around the hypersphere $\mathbb{S}^{d-1}$ to maximize coverage, thus parametrizing a one-dimensional manifold with $\alpha\in[-L,L]$ on $\mathbb{S}^{d-1}$ . Instead of optimizing directly in the $d$ -dimensional space, we can now optimize over the amplitude $\beta\in\mathbb{R}$ and the manifold parametrization $\alpha\in[-L,L]$ . Figure 1 (a) demonstrates this concept for $d=3$ . Taking this idea one step further, we can wrap a $k$ -dimensional space, $[-L,L]^{k}$ , around the hypersphere instead of just a one-dimensional segment. Increasing $k$ is expected to increase the coverage of the hypersphere. This effectively reduces the number of parameters from $d$ to $k+1$ , meaning we now have $\alpha\in[-L,L]^{k}$ and $\beta\in\mathbb{R}$ .

We first train a MLP generator network that covers the hypersphere with fewer number of dimensions. Then, we reparameterize our deep model by constructing its weight matrices using several runs of the generator network. Finally, instead of optimizing the parameters of the deep model, we optimize low-dimensional input to the generator networks that reconstruct the model parameters. We show that our method outperforms several SOTA baselines, including linear random subspace, i.e., PRANC [13] and NOLA [15], on various benchmarks.

Contributions. Our specific contributions in this work are:

1.

Introducing a novel non-linear reparameterization technique for model compression, termed Manifold Constrained Network Compression (MCNC).
2.

Utilization of generative modeling techniques to explicitly parameterize a low-dimensional manifold in a high-dimensional space and constraining model training to this manifold.
3.

Demonstrating the effectiveness of MCNC compared to recent network compression methods in the literature across vision and natural language processing tasks and across diverse architecture.

2 Related Work

Our goal in this paper is to train a deep model with a minimal number of parameters, making it more efficient to transfer between agents or store on devices with limited memory. This motivation is similar to that of recent reparameterization methods [13, 15]. Notably, our compact representation assumes no change in the model size, weight sharing, the number of non-zero parameters, or the precision of computation. Hence, our contributions are truly orthogonal to approaches like weight-sharing, quantization, and knowledge distillation. Hybrid approaches could be developed by combining the proposed Manifold Constrained Neural Compression (MCNC) with these techniques. Here, we provide a brief overview of these techniques, which reduce the model’s size on disk and could improve inference time, typically without significantly affecting the model’s performance.

Weight sharing can be enforced through the model architecture, such as with convolutional kernels [18], recurrent networks [19], or other hybrid architectures [20]. It can also be achieved post-training via clustering mechanisms [21, 22] or hashing [23] to bundle the weights of a model into a few shared components. Other notable weight sharing methods proposed in recent years include [24, 25, 26]. Our proposed method, MCNC, introduces a novel low-dimensional reparameterization of model weights, which can be combined with weight-sharing methods. For instance, different layers of the network could share some of their low-dimensional parameters to further improve compression.

Quantization is one of the most prevalent methods of model compression that focuses on reducing the number of bits required to represent model weights and/or activations (W/A). For instance, Rastegari et al. [27] proposed XNOR-Networks, where binary operations are used in a network with XNOR gates to approximate convolutions, leading to a reported 58 times speedup and 32 times memory savings. The core challenge with these techniques is that naive quantization of models often results in a significant performance drop. On the other hand, quantized training/fine-tuning is often formulated as a constrained optimization problem that requires specialized optimization techniques [28]. A large body of recent work focuses on effective quantization methods for large models [29, 30, 31, 16]. Importantly, in the extreme case, any quantization method would reduce float32 into binary, resulting in a $32\times$ compression (compared to a non-quantized baseline). In this paper, we aim for much larger compression rates. Nevertheless, our proposed method remains compatible with quantized training methods, where the bit-precision of our proposed low-dimensional parameters is reduced.

Pruning Pruning is a primary technique for neural network compression, as highlighted in several studies [32, 33, 34, 35, 36, 37, 38, 27, 28]. High pruning rates, resulting in substantial compression, can be achieved through methods detailed in [39, 40]. Intelligent coding strategies, such as applying Huffman coding on top of pruning and quantization, have been shown to further enhance compression efficiency [41]. Recent approaches, such as MIRACLE [42] and Weightless [43], have demonstrated significantly higher compression rates, achieving improvements of over 400 times. Our ideas are different since we focus on compressing models by reparameterizing them. One may be able to show that such methods can further enhance our compression rate by pruning the reparameterized space.

Compression by reparameterization has recently gained much attention from the community, and our technique falls under this category. In their seminal work, Hu et al. [11] introduced LoRA, which leverages a low-rank reparametrization of the model weights for fine-tuning large models. Many recent extensions of LoRA have also been proposed [44, 45, 16, 15]. More relevant to our proposed approach is PRANC [13], which constrains the model weights to a randomly generated low-dimensional subspace in the model’s parameter space. Notably, NOLA [15] leverages the idea of PRANC to push Parameter-Efficient Fine-Tuning (PEFT) with LoRA below the limit of rank one. PRANC and NOLA inherently rely on the prevalence of good solutions in the parameter space of overparameterized models [17]. Assuming that the ‘good solutions’ in the parameter space are located on a low-dimensional nonlinear manifold [46, 17], there’s a reasonable chance that a randomly drawn low-dimensional subspace within the parameter space will intersect this manifold. This intersection could enable us to find acceptable solutions within our chosen subspace. The core idea behind our paper stems from this question: Can we (nonlinearly) manipulate this k-dimensional subspace in the parameter space such that it better covers the space, thereby increasing the chance of intersection with the manifold of good solutions? Hence, our method can be thought of as a nonlinear extension of PRANC [13] and NOLA [15]. Moreover, we demonstrate that our method, MCNC, is faster than PRANC and NOLA in model reconstruction time (e.g., in mixed mini-batch settings) due to its GPU compatibility and lower FLOPs.

Manifold constrained optimization has been employed in various works to constrain model parameters to a Riemannian manifold in order to induce various geometric characteristics [47]. For instance, some methods require that the model parameters reside on or be close to the Stiefel manifold, i.e., the set of $k$ orthonormal vectors in $\mathbb{R}^{d}$ , to reduce redundant representations [48, 49, 50]. Manifold-constrained optimization [51, 52] in neural networks is gaining significant attention, as evidenced by recent developments in geometric optimization libraries for deep learning [53, 54, 55]. In this work, by learning a generative model, we obtain an explicit parameterization of a low-dimensional manifold that traverses a higher-dimensional hypersphere, and we constrain our optimization in the higher-dimensional space to this low-dimensional manifold. Thanks to this explicit parameterization of the manifold, we emphasize that our optimization process relies solely on standard (stochastic) gradient descent and not on Riemannian optimization.

3 Method

At its core, our Manifold Constrained Neural Compression (MCNC) has two main ingredients: 1) map** the uniform distribution in $[-L,L]^{k}$ to the uniform distribution on the d-dimensional hypersphere, where $k\ll d$ , and 2) learning the deep model by constraining its parameters to be generated by the above map. This process would result in a compression rate of almost $d/k$ .

Notations and Preliminaries. Let $\theta\in\mathbb{R}^{d}$ denote a subset of the model parameters, which we aim to optimize. Through polar decomposition, we express $\theta$ by its amplitude $\beta=\|\theta\|$ and its direction $u=\theta/\|\theta\|$ . Additionally, we denote the $d$ -dimensional hypersphere with $\mathbb{S}^{d-1}$ . Our goal is to reparameterize $u\in\mathbb{S}^{d-1}$ using significantly fewer parameters than $d$ . To achieve this, we will employ a ‘generator’ model to explicitly parameterize a $k$ -dimensional manifold that best represents the $d$ -dimensional hypersphere.

3.1 Traversing a High-Dimensional Hypersphere by a Low-Dimensional Manifold

We aim to model a $d$ -dimensional hypersphere using a $k$ -dimensional manifold. For clarity, let’s examine this concept in a simpler scenario: consider $d=3$ , corresponding to a sphere, and $k=1$ , akin to a string. This familiar setting helps illustrate our objective. Given a string of a specific length, how can we most effectively cover the surface of the sphere? The answer is intuitive: simply wrap the string around the sphere. This wrap** acts as a nonlinear operator that takes a straight line and deforms it around the sphere. Next, we will formalize this concept mathematically.

Let $\mathcal{U}([-L,L]^{k})$ represent the uniform distribution over the $k$ -dimensional hypercube $[-L,L]^{k}$ , and $\mathcal{U}(\mathbb{S}^{d-1})$ represent the uniform distribution over the $d$ -dimensional hypersphere, $\mathbb{S}^{d-1}$ . We are interested in develo** a nonlinear map**, $\phi$ , that converts $\mathcal{U}([-L,L]^{k})$ into $\mathcal{U}(\mathbb{S}^{d-1})$ . Specifically, this involves creating a function $\phi$ that takes samples from the hypercube, $\alpha\sim\mathcal{U}([-L,L]^{k})$ , and maps them to the hypersphere such that $\phi(\alpha)\sim\mathcal{U}(\mathbb{S}^{d-1})$ . This transformation mirrors the challenge of deep generative modeling, a subject thoroughly investigated in literature [56, 57, 58]. We emphasize that our decision to use the uniform distribution on $\mathbb{S}^{d-1}$ as our target distribution stems from an assumption of no prior knowledge about the importance of different directions for the downstream task of optimizing a network. Should such information become available, the target distribution could be adjusted to reflect this knowledge, allowing for more precise wrap** around areas of greater importance.

While there are many options for learning a generative model $\phi$ , in this paper, we chose to use the SWGAN [58] for its simplicity. Briefly, each training iteration begins by sampling a batch from the source distribution, $\{\alpha_{i}\sim\mathcal{U}([-L,L]^{k})\}_{i=1}^{B}$ , and a batch from the target distribution $\{u_{j}\sim\mathcal{U}(\mathbb{S}^{d-1})\}_{j=1}^{B}$ , where $B=4096$ in all our experiments. We then minimize the generalized sliced Wasserstein distance [59] between $\{\phi(\alpha_{i})\}_{i=1}^{B}$ and $\{u_{j}\}_{j=1}^{B}$ with respect to $\phi$ .

For the network $\phi$ , we selected a simple multi-layer perceptron (MLP) with $D$ hidden layers and a fixed size for the hidden layers, with $k$ inputs and $d$ outputs. Each layer of the MLP consists of a linear layer followed by a nonlinearity/activation, i.e., $x^{t}=\sigma(W^{t}x^{t-1})$ where $x^{t}$ is the model’s output at layer $t$ , $x^{0}=\alpha$ is the input, and $\sigma(\cdot)$ is an elementwise nonlinearity. Figure 2 illustrates the learned 1-dimensional manifolds traversing $\mathbb{S}^{2}$ , when $\phi$ is an MLP with 3 hidden layers of size 256, featuring LeakyReLU and Sine functions as the nonlinearity, $\sigma$ , trained via SWGAN. Throughout the rest of this paper, we will use Sinusoidal activations [60] as our default activation. Notably, the parameter $L$ (i.e., the input bound) is analogous to the maximum frequency of Fourier features [61] and controls the frequency of twists as seen in Figure 2.

3.2 Reparameterization and Manifold Constrained Optimization

Having trained a generator model, $\phi:[-L,L]^{k}\rightarrow\mathbb{S}^{d-1}$ , we freeze it and use it to reparameterize $d$ -dimensional vectors as $\theta=\beta\phi(\alpha)$ , where $\beta\in\mathbb{R}$ and $\alpha\in[-L,L]^{k}$ , thereby reducing the number of parameters from $d$ to $k+1$ . Figure 1 (Panel (a)) illustrates this concept. Furthermore, we can enhance the model’s flexibility by using $m$ different pairs of $(\alpha,\beta)$ , constraining $\theta$ to the span of $\{\phi(\alpha_{i})\}_{i=1}^{m}$ . This will lead to a compression ratio of $\frac{d}{m(k+1)}$ (see Figure 1 Panels (b) and (d)).

Let $\mathcal{L}:\mathbb{R}^{d}\to\mathbb{R}$ denote the loss function for a specific task of interest, where $\mathcal{L}(\theta)$ is the associated loss with parameter $\theta$ . We then constrain the training of the parameter $\theta$ as:

\displaystyle\theta^{*}=

\displaystyle\sum_{i=1}^{m}\beta^{*}_{i}\phi(\alpha^{*}_{i}),~{}~{}~{}~{}~{}{(% \alpha^{*}_{i},\beta^{*}_{i})}_{i=1}^{m}=\operatorname*{arg\,min}_{\{(\alpha_{% i},\beta_{i})\}_{i=1}^{m}}\mathcal{L}(\sum_{i=1}^{m}\beta_{i}\phi(\alpha_{i}))

(1)

When $m=1$ , this optimization constrains the direction of the model parameters, $\theta$ , to the $k$ -dimensional manifold on $\mathbb{S}^{d-1}$ that is parameterized by $\phi$ . One of the main advantages of using $m>1$ pairs of $(\alpha,\beta)$ is that we will be able to easily change the compression rate without retraining a generator. Otherwise, in our ablation studies, we show that for a fixed compression rate, $m=1$ and $m>1$ result in similar performances.

3.3 Learning the Deep Model:

We train the generator $\phi(.)$ once using uniformly sampled random data and freeze it. Then, given a deep model, we reshape its parameters to a long vector and then divide that into chunks of size $d$ , and reparameterize each chunk with $m(k+1)$ parameters using the generator $\phi(.)$ . In the case the model size is not divisible by $d$ , the last chunk will have some extra parameters that will be ignored. Finally, we train the model by optimizing ${\{(\alpha_{i},\beta_{i})\}_{i=1}^{m}}$ for all chunks using Eq 1 to minimize the loss of the deep model on the task of interest, e.g., image classification. The backpropagation is as simple as calculating the gradient for model parameters and then using the chain rule to backpropagate through the generator $\phi(.)$ . Hence, autodifferentiation can be directly used to optimize $\alpha$ s and $\beta$ s without the need for geometric optimization techniques.

4 Experiments

To demonstrate the performance of our method, we evaluate MCNC under two different settings:

1.

Training from scratch for image classification. In this setting, we initialize the weights of the network directly using MCNC and optimize the associated $\alpha$ s and $\beta$ s. We show our effectiveness at compressing a wide variety of ResNet architectures as well as ViT.
2.

Parameter Efficient Fine-Tuning of LLMs. In this setting, rather than initializing the weights of the network directly by MCNC, we instead begin with a pre-trained $\theta^{*}$ and instead learn the $\Delta\theta$ via MCNC. We conduct this experiment for LLMs where fine-tuning of large models has become the norm.

Given that our method is orthogonal to the low-rank parameterization used in LoRA [11], we reparameterize either the original networks or their rank-constrained versions. Our experiments explore both scenarios.

4.1 CIFAR-10 and CIFAR-100

We compare MCNC on CIFAR-10 and CIFAR-100 [62] with both pruning methods as well as PRANC [13] and NOLA [15]. We show results on the two datasets using both ResNet-20 and ResNet-56. For the pruning methods and PRANC, we directly compare them with the results reported in [13]. We follow the experimental setting of PRANC [13] and use the same data augmentation, and also do not compress BatchNorm parameters. For MCNC and our implementation of NOLA, we train for $400$ epochs, decreasing the learning rate whenever it does not decrease for $5$ epochs to ensure convergence. We use the Adam [63] optimizer and initial learning rate of $0.001$ for NOLA and $0.05$ for MCNC. For NOLA and MCNC with low-rank parameterization (denoted as w/ LoRA), we follow [15] and use a rank of $64$ . We emphasize that we did not try different ranks and the performance could further be enhanced by choosing the optimal rank for these tasks. For MCNC we let $k=16,d=1600,L=20$ when applied to ResNet-20 and $k=16,d=3200,L=20$ when applied to ResNet-56. We show results in table 1. It can be seen that MCNC is able to consistently outperform all other methods while using a similar number of parameters. In addition, just as NOLA improves upon PRANC with the addition of LoRA, MCNC gains a significant boost as well.

Table 1: Comparison of our method with pruning methods as well as PRANC and NOLA. We include results for our method with and without LoRA applied to each of the layers. As none of the methods compress BatchNorm parameters, we remove them from all parameter counts. The number of BatchNorm parameters are 1,376 for ResNet-20 and 4,064 for ResNet-56. For both NOLA and our method with LoRA, we use a rank of

64

Method	Arch.	Dataset	# Parameters	Acc.
Baseline	R20	CIFAR-10	$269,722$	$88.92$
DPF [33]	R20	CIFAR-10	$4,920\times 2$	$41.86$ [13]
SuRP [40]	R20	CIFAR-10	$1,026\times 2$	$54.22$ [13]
STR [64]	R20	CIFAR-10	$12,238\times 2$	$75.99$ [13]
PRANC [13]	R20	CIFAR-10	$10,000$	$81.48$ [13]
NOLA [15]	R20	CIFAR-10	$11,500$	$82.4$ [15]
Ours w/o LoRA	R20	CIFAR-10	$10,098$	$82.35\pm 0.4$
Ours w/ LoRA	R20	CIFAR-10	$10,098$	$83.76\pm 0.1$
Baseline	R56	CIFAR-10	$853,018$	$91.64$
DPF [33]	R56	CIFAR-10	$13,414\times 2$	$47.66$ [13]
SuRP [40]	R56	CIFAR-10	$10,834\times 2$	$66.65$ [13]
STR [64]	R56	CIFAR-10	$13,312\times 2$	$67.77$ [13]
PRANC [13]	R56	CIFAR-10	$5,000$	$76.87$ [13]
NOLA [15]	R56	CIFAR-10	$5,000$	$78.51\pm 0.1$
Ours w/o LoRA	R56	CIFAR-10	$5,015$	$79.16\pm 0.2$
Ours w/ LoRA	R56	CIFAR-10	$5,015$	$80.38\pm 0.3$
Baseline	R20	CIFAR-100	$275,572$	$60.84$
DPF [33]	R20	CIFAR-100	$10,770\times 2$	$12.25$ [13]
SuRP [40]	R20	CIFAR-100	$6,797\times 2$	$14.46$ [13]
STR [64]	R20	CIFAR-100	$10,673\times 2$	$13.18$ [13]
PRANC [13]	R20	CIFAR-100	$5,000$	$32.33$ [13]
NOLA [15]	R20	CIFAR-100	$5,180$	$35.62\pm 0.1$
Ours w/o LoRA	R20	CIFAR-100	$5,066$	$36.10\pm 0.6$
Ours w/ LoRA	R20	CIFAR-100	$5,066$	$39.52\pm 0.6$
Baseline	R56	CIFAR-100	$858,868$	$64.32$
DPF [33]	R56	CIFAR-100	$19,264\times 2$	$19.11$ [13]
SuRP [40]	R56	CIFAR-100	$10,919\times 2$	$14.59$ [13]
STR [64]	R56	CIFAR-100	$18,881\times 2$	$25.98$ [13]
PRANC [13]	R56	CIFAR-100	$5,000$	$32.97$ [13]
NOLA [15]	R56	CIFAR-100	$5,000$	$36.21\pm 0.6$
Ours w/o LoRA	R56	CIFAR-100	$5,049$	$36.68\pm 0.5$
Ours w/ LoRA	R56	CIFAR-100	$5,015$	$39.05\pm 0.3$

4.2 ImageNet-100

To further demonstrate MCNC’s effectiveness, we expanded our experiments to include the more challenging ImageNet-100 dataset [65]. We evaluated our method on larger architectures, such as ResNet-18, and extended our analysis to Vision Transformers by testing on DeiT-Ti [66], showing that our approach remains effective across these models. For all MCNC experiments, we let $k=16,d=1600,L=20$ and vary $m$ to achieve the desired number of parameters. Table 2 shows the results on ImageNet-100 with ResNet-18. In this experiment, we also varied the number of parameters (i.e., the compression rate) for our method to provide a more holistic comparison. We can see that at a similar compression rate (#parameters $\approx$ 100,000), MCNC outperforms PRANC and NOLA by a large margin. Similarly, Table 3 shows the results of MCNC on ImageNet-100 for the DeiT-Ti model.

Table 2: Comparison with PRANC and NOLA on ImageNet-100 using ResNet-18 with multiple compression rates. All results reported for our method use LoRA. The 9,600 BatchNorm parameters are excluded from all parameter counts. NOLA and our method with LoRA use rank

64

. The accuracy saturation for a higher number of parameters could be due to training with a fixed number of epochs.

Method	# Parameters	Acc.
Baseline	11,218,212	82.1
PRANC [13]	100,000	61.08 [13]
PRANC [13]	200,000	67.28 [13]
LORA [11]	140,400	63.50[15]
NOLA [15]	100,000	64.66[15]
Ours	50,150	$63.52\pm 0.4$
Ours	100,300	$70.09\pm 0.5$
Ours	200,175	$75.03\pm 0.2$
Ours	400,350	$77.02\pm 0.5$
Ours	800,275	$77.76\pm 0.5$
Ours	1,600,125	$77.83\pm 0.7$

Table 3: Comparison of our method with LoRA on DeiT-Ti. We include results for our method both with and without LoRA. We use rank

32

for our method when applicable.

Method	# Parameters	Acc.
Baseline	5,543,716	72.08
LORA(rank=32)	2,855,076	64.96
LORA(rank=1)	155,720	44.79
Ours w/o LoRA	106,538	$47.45\pm 0.0$
Ours w/o LoRA	165,460	$54.20\pm 0.6$
Ours w/ LoRA	106,992	$52.21\pm 0.4$
Ours w/ LoRA	166,560	$56.12\pm 0.2$

Table 4: Instruction finetuning for quantized LLaMA-2: We fine-tune LLaMA-2 with the Alpaca dataset and compare MCNC to NOLA, maintaining the same number of parameters for both. "Generation GFLOPs" refers to the FLOPs needed to generate the original parameters of the adapter on the fly. MCNC achieves comparable performance to NOLA while requiring fewer FLOPs for on-the-fly parameter generation. This efficiency results in faster inference and training for MCNC compared to NOLA. As shown in the table, MCNC achieves a

23\%

increase in throughput for LLaMA 7B and a

35\%

increase for LLaMA 13B. We use a single RTX 3090 GPU to calculate the throughput.

LLaMA-2 - 7B (4-bit)
Method	Trainable	MMLU	Train	Val	Throughput	Model Reconstruction
	Parameters	Acc	Loss	Loss	(Samples/Sec)	GFLOPs
NOLA [15]	28k	45.5	1.08	1.01	2.6	11.19
MCNC	25k	45.8	1.07	1.01	3.2 ( $\uparrow$ 23 $\%$ )	2.75
LLaMA-2 - 13B (4-bit)
NOLA [15]	78k	54.8	1.00	0.96	1.7	17.52
MCNC	77k	55.0	1.00	0.95	2.3 ( $\uparrow$ 35 $\%$ )	4.22

Table 5: The impact of MCNC design decisions on ImageNet-100 classification accuracy using a ResNet-18 with a

100

k parameter budget.

Manifold	Low Rank	$m>1$	Acc.
$\times$	$\times$	✓	$61.08$ [13]
$\times$	✓	✓	$64.66$ [15]
✓	$\times$	$\times$	$63.06\pm 0.3$
✓	$\times$	✓	$61.53\pm 0.2$
✓	✓	$\times$	$70.18\pm 0.4$
✓	✓	✓	$70.09\pm 0.5$

Table 6: Impact of varying

k

and

d

with fixed compression rate on ImageNet-100 classification using a ResNet-18 with a

100

k parameter budget.

Input Size	Output Size	Acc.
$1$	$100$	$30.35\pm 0.9$
$2$	$200$	$48.97\pm 0.2$
$4$	$400$	$68.18\pm 0.3$
$8$	$800$	$69.99\pm 0.6$
$16$	$1600$	$70.08\pm 0.5$
$32$	$3200$	$68.04\pm 0.3$

L	Acc.
$1$	$56.77\pm 1.4$
$5$	$66.23\pm 0.6$
$10$	$69.68\pm 0.5$
$20$	$69.65\pm 0.5$
$40$	$69.85\pm 0.5$
$50$	$58.91\pm 0.5$

4.3 MCNC for Fine-tuning Large Language Models

In this section, we compared the performance of MCNC and NOLA in parameter-efficient fine-tuning. We conducted fine-tuning experiments on two variants of the LLaMA-2 [4] language model, LLaMA 7B, and LLaMA 13B, using the Alpaca dataset as our training data [67]. We report both training loss and validation loss on the Alpaca dataset. Additionally, we report MMLU (Massively Multitask Language Understanding) [68] 5-shot accuracy following the standard practice. This benchmark spans 57 tasks across diverse fields such as computer science, mathematics, law, and history.

Implementation Details: We use the training code from QLoRA [69] and NOLA [15] for our experiments. We quantize the original parameters of the language model to 4-bit and apply and fine-tune the adapter on all layers of the transformer. For NOLA, we followed the hyperparameters reported in [15]. We set rank $=16$ for both NOLA and our method. In our method, we use a generator with the following specifications: a 3-layer MLP with an input dimension of 5, an output dimension of 5000, and a hidden dimension of 32. We use $m=1$ for our method. In NOLA, we use 64 bases for A and B in LLaMA 7B and 140 bases for LLaMA 13B. These numbers of bases result in the same number of optimized parameters as our method for each architecture. All other hyperparameters were identical for both methods except the learning rate; we used $lr=0.001$ for NOLA and $0.05$ for ours.

Results: The results are presented in Table 4. MCNC has comparable performance to NOLA with a similar number of parameters. Importantly, however, MCNC requires $75\%$ fewer GFLOPs for generating parameters on the fly compared to NOLA. This efficiency translates into faster inference and training for MCNC compared to NOLA. As shown in Table 4, MCNC achieves a $23\%$ faster throughput for LLaMA 7B and a $35\%$ faster throughput for LLaMA 13B. While it is feasible to generate parameters offline and merge them with the original weights of the language model, this approach limits the applications when processing multiple tasks and their corresponding adapters in a batch. Therefore, in scenarios involving batch processing of tasks, MCNC holds an advantage over NOLA due to its faster throughput.

4.4 Ablation Studies

Impact of MCNC design decisions: When applying MCNC to a model with a specific parameter count, we can approach this in several ways. Initially, we can design a generator with $k$ and $d$ values that yield the exact number of parameters needed for model reparameterization. Alternatively, we might increase the compression rate of the generator and adjust $m$ accordingly. Another option is to perform a low-rank decomposition of the model before compression, which allows for lower compression rates or higher $m$ values to maintain the same parameter count. We explore the effects of these design choices by assessing the compression performance of ResNet-18 on the ImageNet-100 dataset with a parameter budget of approximately 100,000, selecting $k$ , $d$ , and $m$ to meet this target. Without our nonlinear manifold, this equates to using random bases, specifically NOLA and PRANC, with and without low-rank constraints, respectively. Table 6 clearly shows that integrating the nonlinear manifold consistently enhances performance across various settings. Moreover, increasing $m$ beyond 1 does not significantly affect performance, allowing MCNC to achieve diverse compression rates using a single generator without compromising performance. The most significant performance improvement comes from low-rank decomposition.

Impact of varying $k$ and $d$ with fixed compression rate: The ratio of $k$ to $d$ determines its compression rate, however, selecting appropriate values for $k$ and $d$ under a fixed compression rate presents challenges. To investigate this, we trained multiple generators, each with a compression rate of 100, and applied them to ImageNet-100 using ResNet-18, with $L$ fixed at 10. The results are presented in Table 6. It is evident that a $k$ value close to 1 leads to poor performance, possibly due to reduced degrees of freedom when traversing the manifold, which complicates optimization. Increasing $k$ and $d$ generally enhances performance until it reaches a saturation point and then begins to decline. This decline is likely because, as $d$ becomes very large, the generator struggles to effectively cover the space.

Effect of choice of $L$ when training downstream tasks: As noted in section 3.1, the choice of $L$ controls the frequency of twists in the manifold. In addition to impacting the coverage of the high-dimensional space, this choice can directly impact the ease of optimization when weights are constrained to live inside this manifold. For this reason, we train multiple generators with varying values of $L$ and use each to train a ResNet-20 on CIFAR-100 while fixing $k=5,d=500$ . We show our results in Table 7. As can be seen, when $L$ becomes too small or too large, MCNC’s accuracy begins to drop significantly. When $L$ is low, this is likely due to lacking coverage of the high-dimensional space. When $L$ is large, however, the number of turns in the manifold increase likely leading to difficulties in optimization that result in decreased performance.

Effect of increasing $k$ to decrease compression rate: MCNC’s compression is determined by both the generator compression and the selection of $m$ . We wish to study how the performance improves as we decrease the compression rate of the generator. To do so, we fix $m=1,d=1600$ and vary $k$ . We then use each generator to train ResNet-20 with low rank decomposition on CIFAR-100 and show the results in Figure 3. We can see that performance is able to steadily improve as we decrease the compression rate of the generator. Impressively, we are able to almost entirely recover the original performance of the model while using a fraction of the total number of parameters.

5 Conclusion and Limitations

Leveraging the prevalence of viable solutions in training overparameterized deep models, we have developed a nonlinear reparameterization method aimed at exploring local minima within a low-dimensional manifold of the parameter space. This approach utilizes a generator model that maps a low-dimensional uniform distribution to a high-dimensional uniform distribution, effectively wrap** the low-dimensional input subspace around the high-dimensional hypersphere. Through extensive experiments in computer vision and NLP tasks, we demonstrate that our method, MCNC, significantly outperforms state-of-the-art baselines in terms of accuracy and/or model reconstruction time. We believe this technique can enhance the storage and deployment of large models, which are integral to the recent advancements in AI, thus facilitating more democratized access to AI tools, which may result in negative impacts by putting AI in the hands of many non-sophisticated adversaries.

Limitations: We have not fully explored the optimal generator model configuration, including its architecture, loss function, and types of regularizers, and their effects on the downstream optimization. Moreover, due to computational constraints, we have not shown the effect of MCNC in training large language models from scratch, which may be the most important impact of MCNC.

Acknowledgments

This work was partially supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR00112290115 and HR00112190135 and NSF awards 1845216 and 2339898.

References

Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
Zhai et al. [2022] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Zhang et al. [2023a] Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, et al. Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037, 2023a.
Baevski et al. [2020] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
Grechishnikova [2021] Daria Grechishnikova. Transformer neural network for protein-specific de novo drug generation as a machine translation problem. Scientific reports, 11(1):321, 2021.
Jumper et al. [2021] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Ding et al. [2023] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
Hu et al. [2022] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
Tang et al. [2024] Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, and Dacheng Tao. A survey on transformer compression. arXiv preprint arXiv:2402.05964, 2024.
Nooralinejad et al. [2023] Parsa Nooralinejad, Ali Abbasi, Soroush Abbasi Koohpayegani, Kossar Pourahmadi Meibodi, Rana Muhammad Shahroz Khan, Soheil Kolouri, and Hamed Pirsiavash. PRANC: Pseudo random networks for compacting deep models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17021–17031, 2023.
Girish et al. [2023] Sharath Girish, Kamal Gupta, Saurabh Singh, and Abhinav Shrivastava. Lilnetx: Lightweight networks with EXtreme model compression and structured sparsification. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NVZvalzCLg.
Koohpayegani et al. [2024] Soroush Abbasi Koohpayegani, Navaneet K L, Parsa Nooralinejad, Soheil Kolouri, and Hamed Pirsiavash. NOLA: Compressing LoRA using linear combination of random basis. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=TjfXcDgvzk.
Dettmers et al. [2024] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
Liu et al. [2022] Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85–116, 2022.
Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Dehghani et al. [2019] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyzdRiR9Y7.
Nowlan and Hinton [1992] Steven J Nowlan and Geoffrey E Hinton. Simplifying neural networks by soft weight-sharing. Neural Computation, 4(4):473–493, 1992.
Ullrich et al. [2017] Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=HJGwcKclx.
Chen et al. [2015] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In International conference on machine learning, pages 2285–2294. PMLR, 2015.
Gao et al. [2019] Shangqian Gao, Cheng Deng, and Heng Huang. Cross domain model compression by structurally weight sharing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8973–8982, 2019.
Plummer et al. [2022] Bryan A. Plummer, Nikoli Dryden, Julius Frost, Torsten Hoefler, and Kate Saenko. Neural parameter allocation search. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=srtIXtySfT4.
Shakerinava et al. [2024] Mehran Shakerinava, Motahareh MS Sohrabi, Siamak Ravanbakhsh, and Simon Lacoste-Julien. Weight-sharing regularization. In International Conference on Artificial Intelligence and Statistics, pages 4204–4212. PMLR, 2024.
Rastegari et al. [2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016.
Lee et al. [2021] Junghyup Lee, Dohyung Kim, and Bumsub Ham. Network quantization with element-wise gradient scaling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6448–6457, 2021.
Xiao et al. [2023] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 38087–38099. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/xiao23c.html.
shkolnik et al. [2020] moran shkolnik, Brian Chmiel, Ron Banner, Gil Shomron, Yury Nahshan, Alex Bronstein, and Uri Weiser. Robust quantization: One model to rule them all. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 5308–5317. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/3948ead63a9f2944218de038d8934305-Paper.pdf.
Yao et al. [2023] Zhewei Yao, Cheng Li, Xiaoxia Wu, Stephen Youn, and Yuxiong He. A comprehensive study on post-training quantization for large language models. arXiv preprint arXiv:2303.08302, 2023.
Kim et al. [2020] Woojeong Kim, Suhyun Kim, Mincheol Park, and Geunseok Jeon. Neuron merging: Compensating for pruned neurons. Advances in Neural Information Processing Systems, 33:585–595, 2020.
Lin et al. [2020] Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. Dynamic model pruning with feedback. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJem8lSFwB.
Siems et al. [2021] Julien Niklas Siems, Aaron Klein, Cedric Archambeau, and Maren Mahsereci. Dynamic pruning of a neural network via gradient signal-to-noise ratio. In 8th ICML Workshop on Automated Machine Learning (AutoML), 2021.
Tiwari et al. [2021] Rishabh Tiwari, Udbhav Bamba, Arnav Chavan, and Deepak K Gupta. Chipnet: Budget-aware pruning with heaviside continuous approximations. arXiv preprint arXiv:2102.07156, 2021.
Hayou et al. [2020] Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. Robust pruning at initialization. arXiv preprint arXiv:2002.08797, 2020.
Wang et al. [2020a] Huan Wang, Can Qin, Yulun Zhang, and Yun Fu. Neural pruning via growing regularization. arXiv preprint arXiv:2012.09243, 2020a.
Li et al. [2021] Yuchao Li, Shaohui Lin, Jianzhuang Liu, Qixiang Ye, Mengdi Wang, Fei Chao, Fan Yang, **cheng Ma, Qi Tian, and Rongrong Ji. Towards compact cnns via collaborative compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6438–6447, 2021.
Kusupati et al. [2020a] Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. In Proceedings of the International Conference on Machine Learning, July 2020a.
Isik et al. [2022] Berivan Isik, Tsachy Weissman, and Albert No. An information-theoretic justification for model pruning. In International Conference on Artificial Intelligence and Statistics, pages 3821–3846. PMLR, 2022.
Han et al. [2015] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
Havasi et al. [2018] Marton Havasi, Robert Peharz, and José Miguel Hernández-Lobato. Minimal random code learning: Getting bits back from compressed model parameters. arXiv preprint arXiv:1810.00440, 2018.
Reagan et al. [2018] Brandon Reagan, Udit Gupta, Bob Adolf, Michael Mitzenmacher, Alexander Rush, Gu-Yeon Wei, and David Brooks. Weightless: Lossy weight encoding for deep neural network compression. In International Conference on Machine Learning, pages 4324–4333. PMLR, 2018.
Valipour et al. [2023] Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. Dylora: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3274–3287, 2023.
Zhang et al. [2023b] Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=lq62uWRJjiY.
Li et al. [2018] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryup8-WCW.
Fei et al. [2023] Yanhong Fei, Xian Wei, Yingjie Liu, Zhengyu Li, and Mingsong Chen. A survey of geometric optimization for deep learning: From euclidean space to riemannian manifold. arXiv preprint arXiv:2302.08210, 2023.
Ozay and Okatani [2016] Mete Ozay and Takayuki Okatani. Optimization on submanifolds of convolution kernels in cnns. arXiv preprint arXiv:1610.07008, 2016.
Huang et al. [2018] Lei Huang, Xianglong Liu, Bo Lang, Adams Yu, Yongliang Wang, and Bo Li. Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
Wang et al. [2020b] Jiayun Wang, Yubei Chen, Rudrasis Chakraborty, and Stella X Yu. Orthogonal convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11505–11515, 2020b.
Absil et al. [2008] P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2008.
Boumal [2023] Nicolas Boumal. An introduction to optimization on smooth manifolds. Cambridge University Press, 2023.
Meghwanshi et al. [2018] Mayank Meghwanshi, Pratik Jawanpuria, Anoop Kunchukuttan, Hiroyuki Kasai, and Bamdev Mishra. Mctorch, a manifold optimization library for deep learning. arXiv preprint arXiv:1810.01811, 2018.
Kochurov et al. [2020] Max Kochurov, Rasul Karimov, and Serge Kozlukov. Geoopt: Riemannian optimization in pytorch. arXiv preprint arXiv:2005.02819, 2020.
Luchnikov et al. [2021] Ilia Luchnikov, Alexander Ryzhov, Sergey Filippov, and Henni Ouerdane. Qgopt: Riemannian optimization for quantum technologies. SciPost Physics, 10(3):079, 2021.
Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
Deshpande et al. [2018] Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3483–3491, 2018.
Kolouri et al. [2019] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced wasserstein distances. Advances in neural information processing systems, 32, 2019.
Sitzmann et al. [2020] Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. Advances in neural information processing systems, 33:7462–7473, 2020.
Tancik et al. [2020] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems, 33:7537–7547, 2020.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.longhoe.net/abs/1412.6980.
Kusupati et al. [2020b] Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. In International Conference on Machine Learning, pages 5544–5555. PMLR, 2020b.
Tian et al. [2020] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI, page 776–794, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58620-1. doi: 10.1007/978-3-030-58621-8_45. URL https://doi.org/10.1007/978-3-030-58621-8_45.
Touvron et al. [2021] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/touvron21a.html.
Taori et al. [2023] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
Dettmers et al. [2023] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023.