MCNC: Manifold Constrained Network Compression

Chayne Thrash1
&Ali Abbasi1
&Parsa Nooralinejad2
&Soroush Abbasi Koohpayegani2 \ANDReed Andreas1
&Hamed Pirsiavash2
&Soheil Kolouri1
\AND1 Department of Computer Science, Vanderbilt University, Nashville, TN
\AND2 Department of Computer Science, University of California, Davis, CA
Abstract

The outstanding performance of large foundational models across diverse tasks—from computer vision to speech and natural language processing—has significantly increased their demand. However, storing and transmitting these models pose significant challenges due to their massive size (e.g., 350GB for GPT-3). Recent literature has focused on compressing the original weights or reducing the number of parameters required for fine-tuning these models. These compression methods typically involve constraining the parameter space, for example, through low-rank reparametrization (e.g., LoRA) or quantization (e.g., QLoRA) during model training. In this paper, we present MCNC as a novel model compression method that constrains the parameter space to low-dimensional pre-defined and frozen nonlinear manifolds, which effectively cover this space. Given the prevalence of good solutions in over-parameterized deep neural networks, we show that by constraining the parameter space to our proposed manifold, we can identify high-quality solutions while achieving unprecedented compression rates across a wide variety of tasks. Through extensive experiments in computer vision and natural language processing tasks, we demonstrate that our method, MCNC, significantly outperforms state-of-the-art baselines in terms of compression, accuracy, and/or model reconstruction time.

1 Introduction

Recent state-of-the-art performance in computer vision [1, 2], natural language processing [3, 4], speech [5, 6], and other fields [7, 8] is often achieved through large transformer networks [9] with hundreds of millions to tens of billions of parameters, trained on vast amounts of data. For instance, the GPT-3 model [3], with its 175 billion parameters, requires around 350GB of memory for storage when using float16 precision. As the size of these models continues to increase and the field moves towards large foundational models, the need for effective compression methods that shrink the model size without significant loss of performance is ever-increasing. Model compression has become essential for efficient storage, communication, and potential deployment on edge devices.

Similarly, fine-tuning foundation models to custom tasks and datasets has become the gold standard for obtaining high-performing, customized models. However, besides the storage concerns for such customized models, the overhead involved in loading task-specific models and transferring weights from CPU to GPU often becomes a computational bottleneck in many applications. These concerns have led to the development of Parameter Efficient Fine-Tuning (PEFT) [10] approaches, which have become very popular in recent years. These methods focus on efficiently compressing the residual weights/parameters to adapt the foundational model to specific tasks. For instance, enforcing the residual weights to be low-rank, as proposed in LoRA [11], would significantly reduce the number of trainable parameters for fine-tuning the model.

A wide variety of methods have been proposed for the compression of large models [12]. These approaches can be roughly categorized into five principal techniques: weight sharing, quantization, pruning, knowledge distillation, and reparameterization. Related to our method, reparameterization methods [11, 13, 14, 15] focus on reducing the number of parameters or simplifying computations by restructuring or reparameterizing model weights while maintaining performance. Notably, many recent papers focus on hybrid approaches that leverage multiple principal compression techniques like quantization and reparameterization, such as QLoRA [16]. In this paper, we propose a novel nonlinear reparameterization technique for compressing the weights of large models.

Our work in this paper is inspired by the recent reparameterization methods PRANC [13] and NOLA [15]. At their core, these reparametrization techniques leverage the abundance of good solutions for overparameterized neural networks [17] with d𝑑ditalic_d parameters and constrain the search for optimal solutions to random k𝑘kitalic_k-dimensional subspaces within the parameter space, where kdmuch-less-than𝑘𝑑k\ll ditalic_k ≪ italic_d. Here, we pose the following question: Could we find better solutions by increasing the complexity of our search space from a random subspace [13, 15] to a k-dimensional nonlinear manifold that efficiently covers the parameter space? Below, we further clarify the main concept in our paper through an intuitive thought experiment.

Refer to caption
Figure 1: Consider a model or a module in our model parameterized by θd𝜃superscript𝑑\theta\in\mathbb{R}^{d}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. We use polar decomposition to represent the model parameters as θ=βu𝜃𝛽𝑢\theta=\beta uitalic_θ = italic_β italic_u, where β𝛽\beta\in\mathbb{R}italic_β ∈ blackboard_R is the amplitude, u𝕊d1𝑢superscript𝕊𝑑1u\in\mathbb{S}^{d-1}italic_u ∈ blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT is the unit vector denoting the direction, and 𝕊d1superscript𝕊𝑑1\mathbb{S}^{d-1}blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT is the d𝑑ditalic_d-dimensional hypersphere. The core concept in our proposed work is to wrap a bounded k𝑘kitalic_k-dimensional subspace, [L,L]ksuperscript𝐿𝐿𝑘[-L,L]^{k}[ - italic_L , italic_L ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, around the hypersphere using a nonlinear generator ϕ:[L,L]k𝕊d1:italic-ϕsuperscript𝐿𝐿𝑘superscript𝕊𝑑1\phi:[-L,L]^{k}\to\mathbb{S}^{d-1}italic_ϕ : [ - italic_L , italic_L ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT → blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT, and reparameterize the weights as θ=βϕ(α)𝜃𝛽italic-ϕ𝛼\theta=\beta\phi(\alpha)italic_θ = italic_β italic_ϕ ( italic_α ), where α[L,L]k𝛼superscript𝐿𝐿𝑘\alpha\in[-L,L]^{k}italic_α ∈ [ - italic_L , italic_L ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT (Panel (a)). Additionally, we can increase our modeling flexibility by constraining θ𝜃\thetaitalic_θ to live in the span of {ϕ(αi)}i=1msuperscriptsubscriptitalic-ϕsubscript𝛼𝑖𝑖1𝑚\{\phi(\alpha_{i})\}_{i=1}^{m}{ italic_ϕ ( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, i.e., θ=i=1mβiϕ(αi)𝜃superscriptsubscript𝑖1𝑚subscript𝛽𝑖italic-ϕsubscript𝛼𝑖\theta=\sum_{i=1}^{m}\beta_{i}\phi(\alpha_{i})italic_θ = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), as seen in Panel (b). To train the model/module, we optimize not in the d𝑑ditalic_d-dimensional space, but with respect to {(αi,βi)}i=1msuperscriptsubscriptsubscript𝛼𝑖subscript𝛽𝑖𝑖1𝑚\{(\alpha_{i},\beta_{i})\}_{i=1}^{m}{ ( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, reducing the number of parameters from d𝑑ditalic_d to m(k+1)𝑚𝑘1m(k+1)italic_m ( italic_k + 1 ) (Panel (c)). Lastly, the manifold-constrained weights are obtained via θ=i=1mβiϕ(αi)subscriptsuperscript𝜃superscriptsubscript𝑖1𝑚subscriptsuperscript𝛽𝑖italic-ϕsubscriptsuperscript𝛼𝑖\theta^{*}_{\mathcal{M}}=\sum_{i=1}^{m}\beta^{*}_{i}\phi(\alpha^{*}_{i})italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), as shown in Panel (d).

Winding a string around a sphere. Consider we want to optimize a model with parameters θd𝜃superscript𝑑\theta\in\mathbb{R}^{d}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Using polar decomposition, we can express θ𝜃\thetaitalic_θ as θ=βu𝜃𝛽𝑢\theta=\beta uitalic_θ = italic_β italic_u, where β𝛽\beta\in\mathbb{R}italic_β ∈ blackboard_R is the amplitude, and u=θ/θ𝕊d1𝑢𝜃norm𝜃superscript𝕊𝑑1u=\theta/\|\theta\|\in\mathbb{S}^{d-1}italic_u = italic_θ / ∥ italic_θ ∥ ∈ blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT is a unit vector in the direction of θ𝜃\thetaitalic_θ. Here, 𝕊d1superscript𝕊𝑑1\mathbb{S}^{d-1}blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT denotes the d𝑑ditalic_d-dimensional hypersphere. Now, imagine taking a segment of the real line [L,L]𝐿𝐿[-L,L][ - italic_L , italic_L ], think of a string of length 2L2𝐿2L2 italic_L, and wrap** it around the hypersphere 𝕊d1superscript𝕊𝑑1\mathbb{S}^{d-1}blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT to maximize coverage, thus parametrizing a one-dimensional manifold with α[L,L]𝛼𝐿𝐿\alpha\in[-L,L]italic_α ∈ [ - italic_L , italic_L ] on 𝕊d1superscript𝕊𝑑1\mathbb{S}^{d-1}blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT. Instead of optimizing directly in the d𝑑ditalic_d-dimensional space, we can now optimize over the amplitude β𝛽\beta\in\mathbb{R}italic_β ∈ blackboard_R and the manifold parametrization α[L,L]𝛼𝐿𝐿\alpha\in[-L,L]italic_α ∈ [ - italic_L , italic_L ]. Figure 1 (a) demonstrates this concept for d=3𝑑3d=3italic_d = 3. Taking this idea one step further, we can wrap a k𝑘kitalic_k-dimensional space, [L,L]ksuperscript𝐿𝐿𝑘[-L,L]^{k}[ - italic_L , italic_L ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, around the hypersphere instead of just a one-dimensional segment. Increasing k𝑘kitalic_k is expected to increase the coverage of the hypersphere. This effectively reduces the number of parameters from d𝑑ditalic_d to k+1𝑘1k+1italic_k + 1, meaning we now have α[L,L]k𝛼superscript𝐿𝐿𝑘\alpha\in[-L,L]^{k}italic_α ∈ [ - italic_L , italic_L ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and β𝛽\beta\in\mathbb{R}italic_β ∈ blackboard_R.

We first train a MLP generator network that covers the hypersphere with fewer number of dimensions. Then, we reparameterize our deep model by constructing its weight matrices using several runs of the generator network. Finally, instead of optimizing the parameters of the deep model, we optimize low-dimensional input to the generator networks that reconstruct the model parameters. We show that our method outperforms several SOTA baselines, including linear random subspace, i.e., PRANC [13] and NOLA [15], on various benchmarks.

Contributions. Our specific contributions in this work are:

  1. 1.

    Introducing a novel non-linear reparameterization technique for model compression, termed Manifold Constrained Network Compression (MCNC).

  2. 2.

    Utilization of generative modeling techniques to explicitly parameterize a low-dimensional manifold in a high-dimensional space and constraining model training to this manifold.

  3. 3.

    Demonstrating the effectiveness of MCNC compared to recent network compression methods in the literature across vision and natural language processing tasks and across diverse architecture.

2 Related Work

Our goal in this paper is to train a deep model with a minimal number of parameters, making it more efficient to transfer between agents or store on devices with limited memory. This motivation is similar to that of recent reparameterization methods [13, 15]. Notably, our compact representation assumes no change in the model size, weight sharing, the number of non-zero parameters, or the precision of computation. Hence, our contributions are truly orthogonal to approaches like weight-sharing, quantization, and knowledge distillation. Hybrid approaches could be developed by combining the proposed Manifold Constrained Neural Compression (MCNC) with these techniques. Here, we provide a brief overview of these techniques, which reduce the model’s size on disk and could improve inference time, typically without significantly affecting the model’s performance.

Weight sharing can be enforced through the model architecture, such as with convolutional kernels [18], recurrent networks [19], or other hybrid architectures [20]. It can also be achieved post-training via clustering mechanisms [21, 22] or hashing [23] to bundle the weights of a model into a few shared components. Other notable weight sharing methods proposed in recent years include [24, 25, 26]. Our proposed method, MCNC, introduces a novel low-dimensional reparameterization of model weights, which can be combined with weight-sharing methods. For instance, different layers of the network could share some of their low-dimensional parameters to further improve compression.

Quantization is one of the most prevalent methods of model compression that focuses on reducing the number of bits required to represent model weights and/or activations (W/A). For instance, Rastegari et al. [27] proposed XNOR-Networks, where binary operations are used in a network with XNOR gates to approximate convolutions, leading to a reported 58 times speedup and 32 times memory savings. The core challenge with these techniques is that naive quantization of models often results in a significant performance drop. On the other hand, quantized training/fine-tuning is often formulated as a constrained optimization problem that requires specialized optimization techniques [28]. A large body of recent work focuses on effective quantization methods for large models [29, 30, 31, 16]. Importantly, in the extreme case, any quantization method would reduce float32 into binary, resulting in a 32×32\times32 × compression (compared to a non-quantized baseline). In this paper, we aim for much larger compression rates. Nevertheless, our proposed method remains compatible with quantized training methods, where the bit-precision of our proposed low-dimensional parameters is reduced.

Pruning Pruning is a primary technique for neural network compression, as highlighted in several studies [32, 33, 34, 35, 36, 37, 38, 27, 28]. High pruning rates, resulting in substantial compression, can be achieved through methods detailed in [39, 40]. Intelligent coding strategies, such as applying Huffman coding on top of pruning and quantization, have been shown to further enhance compression efficiency [41]. Recent approaches, such as MIRACLE [42] and Weightless [43], have demonstrated significantly higher compression rates, achieving improvements of over 400 times. Our ideas are different since we focus on compressing models by reparameterizing them. One may be able to show that such methods can further enhance our compression rate by pruning the reparameterized space.

Compression by reparameterization has recently gained much attention from the community, and our technique falls under this category. In their seminal work, Hu et al. [11] introduced LoRA, which leverages a low-rank reparametrization of the model weights for fine-tuning large models. Many recent extensions of LoRA have also been proposed [44, 45, 16, 15]. More relevant to our proposed approach is PRANC [13], which constrains the model weights to a randomly generated low-dimensional subspace in the model’s parameter space. Notably, NOLA [15] leverages the idea of PRANC to push Parameter-Efficient Fine-Tuning (PEFT) with LoRA below the limit of rank one. PRANC and NOLA inherently rely on the prevalence of good solutions in the parameter space of overparameterized models [17]. Assuming that the ‘good solutions’ in the parameter space are located on a low-dimensional nonlinear manifold [46, 17], there’s a reasonable chance that a randomly drawn low-dimensional subspace within the parameter space will intersect this manifold. This intersection could enable us to find acceptable solutions within our chosen subspace. The core idea behind our paper stems from this question: Can we (nonlinearly) manipulate this k-dimensional subspace in the parameter space such that it better covers the space, thereby increasing the chance of intersection with the manifold of good solutions? Hence, our method can be thought of as a nonlinear extension of PRANC [13] and NOLA [15]. Moreover, we demonstrate that our method, MCNC, is faster than PRANC and NOLA in model reconstruction time (e.g., in mixed mini-batch settings) due to its GPU compatibility and lower FLOPs.

Manifold constrained optimization has been employed in various works to constrain model parameters to a Riemannian manifold in order to induce various geometric characteristics [47]. For instance, some methods require that the model parameters reside on or be close to the Stiefel manifold, i.e., the set of k𝑘kitalic_k orthonormal vectors in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, to reduce redundant representations [48, 49, 50]. Manifold-constrained optimization [51, 52] in neural networks is gaining significant attention, as evidenced by recent developments in geometric optimization libraries for deep learning [53, 54, 55]. In this work, by learning a generative model, we obtain an explicit parameterization of a low-dimensional manifold that traverses a higher-dimensional hypersphere, and we constrain our optimization in the higher-dimensional space to this low-dimensional manifold. Thanks to this explicit parameterization of the manifold, we emphasize that our optimization process relies solely on standard (stochastic) gradient descent and not on Riemannian optimization.

3 Method

At its core, our Manifold Constrained Neural Compression (MCNC) has two main ingredients: 1) map** the uniform distribution in [L,L]ksuperscript𝐿𝐿𝑘[-L,L]^{k}[ - italic_L , italic_L ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT to the uniform distribution on the d-dimensional hypersphere, where kdmuch-less-than𝑘𝑑k\ll ditalic_k ≪ italic_d, and 2) learning the deep model by constraining its parameters to be generated by the above map. This process would result in a compression rate of almost d/k𝑑𝑘d/kitalic_d / italic_k.

Notations and Preliminaries. Let θd𝜃superscript𝑑\theta\in\mathbb{R}^{d}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denote a subset of the model parameters, which we aim to optimize. Through polar decomposition, we express θ𝜃\thetaitalic_θ by its amplitude β=θ𝛽norm𝜃\beta=\|\theta\|italic_β = ∥ italic_θ ∥ and its direction u=θ/θ𝑢𝜃norm𝜃u=\theta/\|\theta\|italic_u = italic_θ / ∥ italic_θ ∥. Additionally, we denote the d𝑑ditalic_d-dimensional hypersphere with 𝕊d1superscript𝕊𝑑1\mathbb{S}^{d-1}blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT. Our goal is to reparameterize u𝕊d1𝑢superscript𝕊𝑑1u\in\mathbb{S}^{d-1}italic_u ∈ blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT using significantly fewer parameters than d𝑑ditalic_d. To achieve this, we will employ a ‘generator’ model to explicitly parameterize a k𝑘kitalic_k-dimensional manifold that best represents the d𝑑ditalic_d-dimensional hypersphere.

3.1 Traversing a High-Dimensional Hypersphere by a Low-Dimensional Manifold

We aim to model a d𝑑ditalic_d-dimensional hypersphere using a k𝑘kitalic_k-dimensional manifold. For clarity, let’s examine this concept in a simpler scenario: consider d=3𝑑3d=3italic_d = 3, corresponding to a sphere, and k=1𝑘1k=1italic_k = 1, akin to a string. This familiar setting helps illustrate our objective. Given a string of a specific length, how can we most effectively cover the surface of the sphere? The answer is intuitive: simply wrap the string around the sphere. This wrap** acts as a nonlinear operator that takes a straight line and deforms it around the sphere. Next, we will formalize this concept mathematically.

Let 𝒰([L,L]k)𝒰superscript𝐿𝐿𝑘\mathcal{U}([-L,L]^{k})caligraphic_U ( [ - italic_L , italic_L ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) represent the uniform distribution over the k𝑘kitalic_k-dimensional hypercube [L,L]ksuperscript𝐿𝐿𝑘[-L,L]^{k}[ - italic_L , italic_L ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, and 𝒰(𝕊d1)𝒰superscript𝕊𝑑1\mathcal{U}(\mathbb{S}^{d-1})caligraphic_U ( blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) represent the uniform distribution over the d𝑑ditalic_d-dimensional hypersphere, 𝕊d1superscript𝕊𝑑1\mathbb{S}^{d-1}blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT. We are interested in develo** a nonlinear map**, ϕitalic-ϕ\phiitalic_ϕ, that converts 𝒰([L,L]k)𝒰superscript𝐿𝐿𝑘\mathcal{U}([-L,L]^{k})caligraphic_U ( [ - italic_L , italic_L ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) into 𝒰(𝕊d1)𝒰superscript𝕊𝑑1\mathcal{U}(\mathbb{S}^{d-1})caligraphic_U ( blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ). Specifically, this involves creating a function ϕitalic-ϕ\phiitalic_ϕ that takes samples from the hypercube, α𝒰([L,L]k)similar-to𝛼𝒰superscript𝐿𝐿𝑘\alpha\sim\mathcal{U}([-L,L]^{k})italic_α ∼ caligraphic_U ( [ - italic_L , italic_L ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ), and maps them to the hypersphere such that ϕ(α)𝒰(𝕊d1)similar-toitalic-ϕ𝛼𝒰superscript𝕊𝑑1\phi(\alpha)\sim\mathcal{U}(\mathbb{S}^{d-1})italic_ϕ ( italic_α ) ∼ caligraphic_U ( blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ). This transformation mirrors the challenge of deep generative modeling, a subject thoroughly investigated in literature [56, 57, 58]. We emphasize that our decision to use the uniform distribution on 𝕊d1superscript𝕊𝑑1\mathbb{S}^{d-1}blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT as our target distribution stems from an assumption of no prior knowledge about the importance of different directions for the downstream task of optimizing a network. Should such information become available, the target distribution could be adjusted to reflect this knowledge, allowing for more precise wrap** around areas of greater importance.

While there are many options for learning a generative model ϕitalic-ϕ\phiitalic_ϕ, in this paper, we chose to use the SWGAN [58] for its simplicity. Briefly, each training iteration begins by sampling a batch from the source distribution, {αi𝒰([L,L]k)}i=1Bsuperscriptsubscriptsimilar-tosubscript𝛼𝑖𝒰superscript𝐿𝐿𝑘𝑖1𝐵\{\alpha_{i}\sim\mathcal{U}([-L,L]^{k})\}_{i=1}^{B}{ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_U ( [ - italic_L , italic_L ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT, and a batch from the target distribution {uj𝒰(𝕊d1)}j=1Bsuperscriptsubscriptsimilar-tosubscript𝑢𝑗𝒰superscript𝕊𝑑1𝑗1𝐵\{u_{j}\sim\mathcal{U}(\mathbb{S}^{d-1})\}_{j=1}^{B}{ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ caligraphic_U ( blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT, where B=4096𝐵4096B=4096italic_B = 4096 in all our experiments. We then minimize the generalized sliced Wasserstein distance [59] between {ϕ(αi)}i=1Bsuperscriptsubscriptitalic-ϕsubscript𝛼𝑖𝑖1𝐵\{\phi(\alpha_{i})\}_{i=1}^{B}{ italic_ϕ ( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT and {uj}j=1Bsuperscriptsubscriptsubscript𝑢𝑗𝑗1𝐵\{u_{j}\}_{j=1}^{B}{ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT with respect to ϕitalic-ϕ\phiitalic_ϕ.

For the network ϕitalic-ϕ\phiitalic_ϕ, we selected a simple multi-layer perceptron (MLP) with D𝐷Ditalic_D hidden layers and a fixed size for the hidden layers, with k𝑘kitalic_k inputs and d𝑑ditalic_d outputs. Each layer of the MLP consists of a linear layer followed by a nonlinearity/activation, i.e., xt=σ(Wtxt1)superscript𝑥𝑡𝜎superscript𝑊𝑡superscript𝑥𝑡1x^{t}=\sigma(W^{t}x^{t-1})italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_σ ( italic_W start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) where xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the model’s output at layer t𝑡titalic_t, x0=αsuperscript𝑥0𝛼x^{0}=\alphaitalic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_α is the input, and σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is an elementwise nonlinearity. Figure 2 illustrates the learned 1-dimensional manifolds traversing 𝕊2superscript𝕊2\mathbb{S}^{2}blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, when ϕitalic-ϕ\phiitalic_ϕ is an MLP with 3 hidden layers of size 256, featuring LeakyReLU and Sine functions as the nonlinearity, σ𝜎\sigmaitalic_σ, trained via SWGAN. Throughout the rest of this paper, we will use Sinusoidal activations [60] as our default activation. Notably, the parameter L𝐿Litalic_L (i.e., the input bound) is analogous to the maximum frequency of Fourier features [61] and controls the frequency of twists as seen in Figure 2.

Refer to caption
Figure 2: Traversing a sphere using a 1-dimensional manifold, where ϕitalic-ϕ\phiitalic_ϕ is a multi-layer perceptron with either LeakyReLU (left) or Sinusoidal activations (right), trained using SWGAN. Here, L𝐿Litalic_L represents the bound on the input to the network.

3.2 Reparameterization and Manifold Constrained Optimization

Having trained a generator model, ϕ:[L,L]k𝕊d1:italic-ϕsuperscript𝐿𝐿𝑘superscript𝕊𝑑1\phi:[-L,L]^{k}\rightarrow\mathbb{S}^{d-1}italic_ϕ : [ - italic_L , italic_L ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT → blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT, we freeze it and use it to reparameterize d𝑑ditalic_d-dimensional vectors as θ=βϕ(α)𝜃𝛽italic-ϕ𝛼\theta=\beta\phi(\alpha)italic_θ = italic_β italic_ϕ ( italic_α ), where β𝛽\beta\in\mathbb{R}italic_β ∈ blackboard_R and α[L,L]k𝛼superscript𝐿𝐿𝑘\alpha\in[-L,L]^{k}italic_α ∈ [ - italic_L , italic_L ] start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, thereby reducing the number of parameters from d𝑑ditalic_d to k+1𝑘1k+1italic_k + 1. Figure 1 (Panel (a)) illustrates this concept. Furthermore, we can enhance the model’s flexibility by using m𝑚mitalic_m different pairs of (α,β)𝛼𝛽(\alpha,\beta)( italic_α , italic_β ), constraining θ𝜃\thetaitalic_θ to the span of {ϕ(αi)}i=1msuperscriptsubscriptitalic-ϕsubscript𝛼𝑖𝑖1𝑚\{\phi(\alpha_{i})\}_{i=1}^{m}{ italic_ϕ ( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. This will lead to a compression ratio of dm(k+1)𝑑𝑚𝑘1\frac{d}{m(k+1)}divide start_ARG italic_d end_ARG start_ARG italic_m ( italic_k + 1 ) end_ARG (see Figure 1 Panels (b) and (d)).

Let :d:superscript𝑑\mathcal{L}:\mathbb{R}^{d}\to\mathbb{R}caligraphic_L : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R denote the loss function for a specific task of interest, where (θ)𝜃\mathcal{L}(\theta)caligraphic_L ( italic_θ ) is the associated loss with parameter θ𝜃\thetaitalic_θ. We then constrain the training of the parameter θ𝜃\thetaitalic_θ as:

θ=superscript𝜃absent\displaystyle\theta^{*}=italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = i=1mβiϕ(αi),(αi,βi)i=1m=argmin{(αi,βi)}i=1m(i=1mβiϕ(αi))superscriptsubscript𝑖1𝑚subscriptsuperscript𝛽𝑖italic-ϕsubscriptsuperscript𝛼𝑖superscriptsubscriptsubscriptsuperscript𝛼𝑖subscriptsuperscript𝛽𝑖𝑖1𝑚subscriptargminsuperscriptsubscriptsubscript𝛼𝑖subscript𝛽𝑖𝑖1𝑚superscriptsubscript𝑖1𝑚subscript𝛽𝑖italic-ϕsubscript𝛼𝑖\displaystyle\sum_{i=1}^{m}\beta^{*}_{i}\phi(\alpha^{*}_{i}),~{}~{}~{}~{}~{}{(% \alpha^{*}_{i},\beta^{*}_{i})}_{i=1}^{m}=\operatorname*{arg\,min}_{\{(\alpha_{% i},\beta_{i})\}_{i=1}^{m}}\mathcal{L}(\sum_{i=1}^{m}\beta_{i}\phi(\alpha_{i}))∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ( italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT { ( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (1)

When m=1𝑚1m=1italic_m = 1, this optimization constrains the direction of the model parameters, θ𝜃\thetaitalic_θ, to the k𝑘kitalic_k-dimensional manifold on 𝕊d1superscript𝕊𝑑1\mathbb{S}^{d-1}blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT that is parameterized by ϕitalic-ϕ\phiitalic_ϕ. One of the main advantages of using m>1𝑚1m>1italic_m > 1 pairs of (α,β)𝛼𝛽(\alpha,\beta)( italic_α , italic_β ) is that we will be able to easily change the compression rate without retraining a generator. Otherwise, in our ablation studies, we show that for a fixed compression rate, m=1𝑚1m=1italic_m = 1 and m>1𝑚1m>1italic_m > 1 result in similar performances.

3.3 Learning the Deep Model:

We train the generator ϕ(.)\phi(.)italic_ϕ ( . ) once using uniformly sampled random data and freeze it. Then, given a deep model, we reshape its parameters to a long vector and then divide that into chunks of size d𝑑ditalic_d, and reparameterize each chunk with m(k+1)𝑚𝑘1m(k+1)italic_m ( italic_k + 1 ) parameters using the generator ϕ(.)\phi(.)italic_ϕ ( . ). In the case the model size is not divisible by d𝑑ditalic_d, the last chunk will have some extra parameters that will be ignored. Finally, we train the model by optimizing {(αi,βi)}i=1msuperscriptsubscriptsubscript𝛼𝑖subscript𝛽𝑖𝑖1𝑚{\{(\alpha_{i},\beta_{i})\}_{i=1}^{m}}{ ( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT for all chunks using Eq 1 to minimize the loss of the deep model on the task of interest, e.g., image classification. The backpropagation is as simple as calculating the gradient for model parameters and then using the chain rule to backpropagate through the generator ϕ(.)\phi(.)italic_ϕ ( . ). Hence, autodifferentiation can be directly used to optimize α𝛼\alphaitalic_αs and β𝛽\betaitalic_βs without the need for geometric optimization techniques.

4 Experiments

To demonstrate the performance of our method, we evaluate MCNC under two different settings:

  1. 1.

    Training from scratch for image classification. In this setting, we initialize the weights of the network directly using MCNC and optimize the associated α𝛼\alphaitalic_αs and β𝛽\betaitalic_βs. We show our effectiveness at compressing a wide variety of ResNet architectures as well as ViT.

  2. 2.

    Parameter Efficient Fine-Tuning of LLMs. In this setting, rather than initializing the weights of the network directly by MCNC, we instead begin with a pre-trained θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and instead learn the ΔθΔ𝜃\Delta\thetaroman_Δ italic_θ via MCNC. We conduct this experiment for LLMs where fine-tuning of large models has become the norm.

Given that our method is orthogonal to the low-rank parameterization used in LoRA [11], we reparameterize either the original networks or their rank-constrained versions. Our experiments explore both scenarios.

4.1 CIFAR-10 and CIFAR-100

We compare MCNC on CIFAR-10 and CIFAR-100 [62] with both pruning methods as well as PRANC [13] and NOLA [15]. We show results on the two datasets using both ResNet-20 and ResNet-56. For the pruning methods and PRANC, we directly compare them with the results reported in [13]. We follow the experimental setting of PRANC [13] and use the same data augmentation, and also do not compress BatchNorm parameters. For MCNC and our implementation of NOLA, we train for 400400400400 epochs, decreasing the learning rate whenever it does not decrease for 5555 epochs to ensure convergence. We use the Adam [63] optimizer and initial learning rate of 0.0010.0010.0010.001 for NOLA and 0.050.050.050.05 for MCNC. For NOLA and MCNC with low-rank parameterization (denoted as w/ LoRA), we follow [15] and use a rank of 64646464. We emphasize that we did not try different ranks and the performance could further be enhanced by choosing the optimal rank for these tasks. For MCNC we let k=16,d=1600,L=20formulae-sequence𝑘16formulae-sequence𝑑1600𝐿20k=16,d=1600,L=20italic_k = 16 , italic_d = 1600 , italic_L = 20 when applied to ResNet-20 and k=16,d=3200,L=20formulae-sequence𝑘16formulae-sequence𝑑3200𝐿20k=16,d=3200,L=20italic_k = 16 , italic_d = 3200 , italic_L = 20 when applied to ResNet-56. We show results in table 1. It can be seen that MCNC is able to consistently outperform all other methods while using a similar number of parameters. In addition, just as NOLA improves upon PRANC with the addition of LoRA, MCNC gains a significant boost as well.

Table 1: Comparison of our method with pruning methods as well as PRANC and NOLA. We include results for our method with and without LoRA applied to each of the layers. As none of the methods compress BatchNorm parameters, we remove them from all parameter counts. The number of BatchNorm parameters are 1,376 for ResNet-20 and 4,064 for ResNet-56. For both NOLA and our method with LoRA, we use a rank of 64646464.
Method Arch. Dataset # Parameters Acc.
Baseline R20 CIFAR-10 269,722269722269,722269 , 722 88.9288.9288.9288.92
DPF [33] R20 CIFAR-10 4,920×2492024,920\times 24 , 920 × 2 41.8641.8641.8641.86[13]
SuRP [40] R20 CIFAR-10 1,026×2102621,026\times 21 , 026 × 2 54.2254.2254.2254.22[13]
STR [64] R20 CIFAR-10 12,238×212238212,238\times 212 , 238 × 2 75.9975.9975.9975.99[13]
PRANC [13] R20 CIFAR-10 10,0001000010,00010 , 000 81.4881.4881.4881.48[13]
NOLA [15] R20 CIFAR-10 11,5001150011,50011 , 500 82.482.482.482.4[15]
Ours w/o LoRA R20 CIFAR-10 10,0981009810,09810 , 098 82.35±0.4plus-or-minus82.350.482.35\pm 0.482.35 ± 0.4
Ours w/ LoRA R20 CIFAR-10 𝟏𝟎,𝟎𝟗𝟖1009810,098bold_10 bold_, bold_098 83.76±0.1plus-or-minus83.760.183.76\pm 0.1bold_83.76 bold_± bold_0.1
Baseline R56 CIFAR-10 853,018853018853,018853 , 018 91.6491.6491.6491.64
DPF [33] R56 CIFAR-10 13,414×213414213,414\times 213 , 414 × 2 47.6647.6647.6647.66[13]
SuRP [40] R56 CIFAR-10 10,834×210834210,834\times 210 , 834 × 2 66.6566.6566.6566.65[13]
STR [64] R56 CIFAR-10 13,312×213312213,312\times 213 , 312 × 2 67.7767.7767.7767.77[13]
PRANC [13] R56 CIFAR-10 5,00050005,0005 , 000 76.8776.8776.8776.87[13]
NOLA [15] R56 CIFAR-10 5,00050005,0005 , 000 78.51±0.1plus-or-minus78.510.178.51\pm 0.178.51 ± 0.1
Ours w/o LoRA R56 CIFAR-10 5,01550155,0155 , 015 79.16±0.2plus-or-minus79.160.279.16\pm 0.279.16 ± 0.2
Ours w/ LoRA R56 CIFAR-10 𝟓,𝟎𝟏𝟓50155,015bold_5 bold_, bold_015 80.38±0.3plus-or-minus80.380.380.38\pm 0.3bold_80.38 bold_± bold_0.3
Baseline R20 CIFAR-100 275,572275572275,572275 , 572 60.8460.8460.8460.84
DPF [33] R20 CIFAR-100 10,770×210770210,770\times 210 , 770 × 2 12.2512.2512.2512.25[13]
SuRP [40] R20 CIFAR-100 6,797×2679726,797\times 26 , 797 × 2 14.4614.4614.4614.46[13]
STR [64] R20 CIFAR-100 10,673×210673210,673\times 210 , 673 × 2 13.1813.1813.1813.18[13]
PRANC [13] R20 CIFAR-100 5,00050005,0005 , 000 32.3332.3332.3332.33[13]
NOLA [15] R20 CIFAR-100 5,18051805,1805 , 180 35.62±0.1plus-or-minus35.620.135.62\pm 0.135.62 ± 0.1
Ours w/o LoRA R20 CIFAR-100 5,06650665,0665 , 066 36.10±0.6plus-or-minus36.100.636.10\pm 0.636.10 ± 0.6
Ours w/ LoRA R20 CIFAR-100 𝟓,𝟎𝟔𝟔50665,066bold_5 bold_, bold_066 39.52±0.6plus-or-minus39.520.639.52\pm 0.6bold_39.52 bold_± bold_0.6
Baseline R56 CIFAR-100 858,868858868858,868858 , 868 64.3264.3264.3264.32
DPF [33] R56 CIFAR-100 19,264×219264219,264\times 219 , 264 × 2 19.1119.1119.1119.11[13]
SuRP [40] R56 CIFAR-100 10,919×210919210,919\times 210 , 919 × 2 14.5914.5914.5914.59[13]
STR [64] R56 CIFAR-100 18,881×218881218,881\times 218 , 881 × 2 25.9825.9825.9825.98[13]
PRANC [13] R56 CIFAR-100 5,00050005,0005 , 000 32.9732.9732.9732.97[13]
NOLA [15] R56 CIFAR-100 5,00050005,0005 , 000 36.21±0.6plus-or-minus36.210.636.21\pm 0.636.21 ± 0.6
Ours w/o LoRA R56 CIFAR-100 5,04950495,0495 , 049 36.68±0.5plus-or-minus36.680.536.68\pm 0.536.68 ± 0.5
Ours w/ LoRA R56 CIFAR-100 𝟓,𝟎𝟏𝟓50155,015bold_5 bold_, bold_015 39.05±0.3plus-or-minus39.050.339.05\pm 0.3bold_39.05 bold_± bold_0.3

4.2 ImageNet-100

To further demonstrate MCNC’s effectiveness, we expanded our experiments to include the more challenging ImageNet-100 dataset [65]. We evaluated our method on larger architectures, such as ResNet-18, and extended our analysis to Vision Transformers by testing on DeiT-Ti [66], showing that our approach remains effective across these models. For all MCNC experiments, we let k=16,d=1600,L=20formulae-sequence𝑘16formulae-sequence𝑑1600𝐿20k=16,d=1600,L=20italic_k = 16 , italic_d = 1600 , italic_L = 20 and vary m𝑚mitalic_m to achieve the desired number of parameters. Table 2 shows the results on ImageNet-100 with ResNet-18. In this experiment, we also varied the number of parameters (i.e., the compression rate) for our method to provide a more holistic comparison. We can see that at a similar compression rate (#parameters\approx 100,000), MCNC outperforms PRANC and NOLA by a large margin. Similarly, Table 3 shows the results of MCNC on ImageNet-100 for the DeiT-Ti model.

Table 2: Comparison with PRANC and NOLA on ImageNet-100 using ResNet-18 with multiple compression rates. All results reported for our method use LoRA. The 9,600 BatchNorm parameters are excluded from all parameter counts. NOLA and our method with LoRA use rank 64646464. The accuracy saturation for a higher number of parameters could be due to training with a fixed number of epochs.
Method # Parameters Acc.
Baseline 11,218,212 82.1
PRANC [13] 100,000 61.08 [13]
PRANC [13] 200,000 67.28 [13]
LORA [11] 140,400 63.50[15]
NOLA [15] 100,000 64.66[15]
Ours 50,150 63.52±0.4plus-or-minus63.520.463.52\pm 0.463.52 ± 0.4
Ours 100,300 70.09±0.5plus-or-minus70.090.570.09\pm 0.570.09 ± 0.5
Ours 200,175 75.03±0.2plus-or-minus75.030.275.03\pm 0.275.03 ± 0.2
Ours 400,350 77.02±0.5plus-or-minus77.020.577.02\pm 0.577.02 ± 0.5
Ours 800,275 77.76±0.5plus-or-minus77.760.577.76\pm 0.577.76 ± 0.5
Ours 1,600,125 77.83±0.7plus-or-minus77.830.777.83\pm 0.777.83 ± 0.7
Table 3: Comparison of our method with LoRA on DeiT-Ti. We include results for our method both with and without LoRA. We use rank 32323232 for our method when applicable.
Method # Parameters Acc.
Baseline 5,543,716 72.08
LORA(rank=32) 2,855,076 64.96
LORA(rank=1) 155,720 44.79
Ours w/o LoRA 106,538 47.45±0.0plus-or-minus47.450.047.45\pm 0.047.45 ± 0.0
Ours w/o LoRA 165,460 54.20±0.6plus-or-minus54.200.654.20\pm 0.654.20 ± 0.6
Ours w/ LoRA 106,992 52.21±0.4plus-or-minus52.210.452.21\pm 0.452.21 ± 0.4
Ours w/ LoRA 166,560 56.12±0.2plus-or-minus56.120.256.12\pm 0.256.12 ± 0.2
Table 4: Instruction finetuning for quantized LLaMA-2: We fine-tune LLaMA-2 with the Alpaca dataset and compare MCNC to NOLA, maintaining the same number of parameters for both. "Generation GFLOPs" refers to the FLOPs needed to generate the original parameters of the adapter on the fly. MCNC achieves comparable performance to NOLA while requiring fewer FLOPs for on-the-fly parameter generation. This efficiency results in faster inference and training for MCNC compared to NOLA. As shown in the table, MCNC achieves a 23%percent2323\%23 % increase in throughput for LLaMA 7B and a 35%percent3535\%35 % increase for LLaMA 13B. We use a single RTX 3090 GPU to calculate the throughput.
LLaMA-2 - 7B (4-bit)
Method Trainable MMLU Train Val Throughput Model Reconstruction
Parameters Acc Loss Loss (Samples/Sec) GFLOPs
NOLA [15] 28k 45.5 1.08 1.01 2.6 11.19
MCNC 25k 45.8 1.07 1.01 3.2 (\uparrow23%percent\%%) 2.75
LLaMA-2 - 13B (4-bit)
NOLA [15] 78k 54.8 1.00 0.96 1.7 17.52
MCNC 77k 55.0 1.00 0.95 2.3 (\uparrow35%percent\%%) 4.22
Table 5: The impact of MCNC design decisions on ImageNet-100 classification accuracy using a ResNet-18 with a 100100100100k parameter budget.
Manifold Low Rank 𝒎>𝟏𝒎1m>1bold_italic_m bold_> bold_1 Acc.
×\times× ×\times× 61.0861.0861.0861.08[13]
×\times× 64.6664.6664.6664.66[15]
×\times× ×\times× 63.06±0.3plus-or-minus63.060.363.06\pm 0.363.06 ± 0.3
×\times× 61.53±0.2plus-or-minus61.530.261.53\pm 0.261.53 ± 0.2
×\times× 70.18±0.4plus-or-minus70.180.470.18\pm 0.4bold_70.18 bold_± bold_0.4
70.09±0.5plus-or-minus70.090.570.09\pm 0.570.09 ± 0.5
Table 6: Impact of varying k𝑘kitalic_k and d𝑑ditalic_d with fixed compression rate on ImageNet-100 classification using a ResNet-18 with a 100100100100k parameter budget.
Input Size Output Size Acc.
1111 100100100100 30.35±0.9plus-or-minus30.350.930.35\pm 0.930.35 ± 0.9
2222 200200200200 48.97±0.2plus-or-minus48.970.248.97\pm 0.248.97 ± 0.2
4444 400400400400 68.18±0.3plus-or-minus68.180.368.18\pm 0.368.18 ± 0.3
8888 800800800800 69.99±0.6plus-or-minus69.990.669.99\pm 0.669.99 ± 0.6
𝟏𝟔1616bold_16 𝟏𝟔𝟎𝟎16001600bold_1600 70.08±0.5plus-or-minus70.080.570.08\pm 0.5bold_70.08 bold_± bold_0.5
32323232 3200320032003200 68.04±0.3plus-or-minus68.040.368.04\pm 0.368.04 ± 0.3
L Acc.
1111 56.77±1.4plus-or-minus56.771.456.77\pm 1.456.77 ± 1.4
5555 66.23±0.6plus-or-minus66.230.666.23\pm 0.666.23 ± 0.6
10101010 69.68±0.5plus-or-minus69.680.569.68\pm 0.569.68 ± 0.5
20202020 69.65±0.5plus-or-minus69.650.569.65\pm 0.569.65 ± 0.5
𝟒𝟎4040bold_40 69.85±0.5plus-or-minus69.850.569.85\pm 0.5bold_69.85 bold_± bold_0.5
50505050 58.91±0.5plus-or-minus58.910.558.91\pm 0.558.91 ± 0.5
Table 7: The effect of increasing L𝐿Litalic_L on the accuracy of ImageNet-100.
Refer to caption
Figure 3: The effect of increasing k𝑘kitalic_k on the classification accuracy of CIFAR-100.

4.3 MCNC for Fine-tuning Large Language Models

In this section, we compared the performance of MCNC and NOLA in parameter-efficient fine-tuning. We conducted fine-tuning experiments on two variants of the LLaMA-2 [4] language model, LLaMA 7B, and LLaMA 13B, using the Alpaca dataset as our training data [67]. We report both training loss and validation loss on the Alpaca dataset. Additionally, we report MMLU (Massively Multitask Language Understanding) [68] 5-shot accuracy following the standard practice. This benchmark spans 57 tasks across diverse fields such as computer science, mathematics, law, and history.

Implementation Details: We use the training code from QLoRA [69] and NOLA [15] for our experiments. We quantize the original parameters of the language model to 4-bit and apply and fine-tune the adapter on all layers of the transformer. For NOLA, we followed the hyperparameters reported in [15]. We set rank=16absent16=16= 16 for both NOLA and our method. In our method, we use a generator with the following specifications: a 3-layer MLP with an input dimension of 5, an output dimension of 5000, and a hidden dimension of 32. We use m=1𝑚1m=1italic_m = 1 for our method. In NOLA, we use 64 bases for A and B in LLaMA 7B and 140 bases for LLaMA 13B. These numbers of bases result in the same number of optimized parameters as our method for each architecture. All other hyperparameters were identical for both methods except the learning rate; we used lr=0.001𝑙𝑟0.001lr=0.001italic_l italic_r = 0.001 for NOLA and 0.050.050.050.05 for ours.

Results: The results are presented in Table 4. MCNC has comparable performance to NOLA with a similar number of parameters. Importantly, however, MCNC requires 75%percent7575\%75 % fewer GFLOPs for generating parameters on the fly compared to NOLA. This efficiency translates into faster inference and training for MCNC compared to NOLA. As shown in Table 4, MCNC achieves a 23%percent2323\%23 % faster throughput for LLaMA 7B and a 35%percent3535\%35 % faster throughput for LLaMA 13B. While it is feasible to generate parameters offline and merge them with the original weights of the language model, this approach limits the applications when processing multiple tasks and their corresponding adapters in a batch. Therefore, in scenarios involving batch processing of tasks, MCNC holds an advantage over NOLA due to its faster throughput.

4.4 Ablation Studies

Impact of MCNC design decisions: When applying MCNC to a model with a specific parameter count, we can approach this in several ways. Initially, we can design a generator with k𝑘kitalic_k and d𝑑ditalic_d values that yield the exact number of parameters needed for model reparameterization. Alternatively, we might increase the compression rate of the generator and adjust m𝑚mitalic_m accordingly. Another option is to perform a low-rank decomposition of the model before compression, which allows for lower compression rates or higher m𝑚mitalic_m values to maintain the same parameter count. We explore the effects of these design choices by assessing the compression performance of ResNet-18 on the ImageNet-100 dataset with a parameter budget of approximately 100,000, selecting k𝑘kitalic_k, d𝑑ditalic_d, and m𝑚mitalic_m to meet this target. Without our nonlinear manifold, this equates to using random bases, specifically NOLA and PRANC, with and without low-rank constraints, respectively. Table 6 clearly shows that integrating the nonlinear manifold consistently enhances performance across various settings. Moreover, increasing m𝑚mitalic_m beyond 1 does not significantly affect performance, allowing MCNC to achieve diverse compression rates using a single generator without compromising performance. The most significant performance improvement comes from low-rank decomposition.

Impact of varying k𝑘kitalic_k and d𝑑ditalic_d with fixed compression rate: The ratio of k𝑘kitalic_k to d𝑑ditalic_d determines its compression rate, however, selecting appropriate values for k𝑘kitalic_k and d𝑑ditalic_d under a fixed compression rate presents challenges. To investigate this, we trained multiple generators, each with a compression rate of 100, and applied them to ImageNet-100 using ResNet-18, with L𝐿Litalic_L fixed at 10. The results are presented in Table 6. It is evident that a k𝑘kitalic_k value close to 1 leads to poor performance, possibly due to reduced degrees of freedom when traversing the manifold, which complicates optimization. Increasing k𝑘kitalic_k and d𝑑ditalic_d generally enhances performance until it reaches a saturation point and then begins to decline. This decline is likely because, as d𝑑ditalic_d becomes very large, the generator struggles to effectively cover the space.

Effect of choice of L𝐿Litalic_L when training downstream tasks: As noted in section 3.1, the choice of L𝐿Litalic_L controls the frequency of twists in the manifold. In addition to impacting the coverage of the high-dimensional space, this choice can directly impact the ease of optimization when weights are constrained to live inside this manifold. For this reason, we train multiple generators with varying values of L𝐿Litalic_L and use each to train a ResNet-20 on CIFAR-100 while fixing k=5,d=500formulae-sequence𝑘5𝑑500k=5,d=500italic_k = 5 , italic_d = 500. We show our results in Table 7. As can be seen, when L𝐿Litalic_L becomes too small or too large, MCNC’s accuracy begins to drop significantly. When L𝐿Litalic_L is low, this is likely due to lacking coverage of the high-dimensional space. When L𝐿Litalic_L is large, however, the number of turns in the manifold increase likely leading to difficulties in optimization that result in decreased performance.

Effect of increasing k𝑘kitalic_k to decrease compression rate: MCNC’s compression is determined by both the generator compression and the selection of m𝑚mitalic_m. We wish to study how the performance improves as we decrease the compression rate of the generator. To do so, we fix m=1,d=1600formulae-sequence𝑚1𝑑1600m=1,d=1600italic_m = 1 , italic_d = 1600 and vary k𝑘kitalic_k. We then use each generator to train ResNet-20 with low rank decomposition on CIFAR-100 and show the results in Figure 3. We can see that performance is able to steadily improve as we decrease the compression rate of the generator. Impressively, we are able to almost entirely recover the original performance of the model while using a fraction of the total number of parameters.

5 Conclusion and Limitations

Leveraging the prevalence of viable solutions in training overparameterized deep models, we have developed a nonlinear reparameterization method aimed at exploring local minima within a low-dimensional manifold of the parameter space. This approach utilizes a generator model that maps a low-dimensional uniform distribution to a high-dimensional uniform distribution, effectively wrap** the low-dimensional input subspace around the high-dimensional hypersphere. Through extensive experiments in computer vision and NLP tasks, we demonstrate that our method, MCNC, significantly outperforms state-of-the-art baselines in terms of accuracy and/or model reconstruction time. We believe this technique can enhance the storage and deployment of large models, which are integral to the recent advancements in AI, thus facilitating more democratized access to AI tools, which may result in negative impacts by putting AI in the hands of many non-sophisticated adversaries.

Limitations: We have not fully explored the optimal generator model configuration, including its architecture, loss function, and types of regularizers, and their effects on the downstream optimization. Moreover, due to computational constraints, we have not shown the effect of MCNC in training large language models from scratch, which may be the most important impact of MCNC.

Acknowledgments

This work was partially supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR00112290115 and HR00112190135 and NSF awards 1845216 and 2339898.

References

  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  • Zhai et al. [2022] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022.
  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Zhang et al. [2023a] Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, et al. Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037, 2023a.
  • Baevski et al. [2020] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  • Grechishnikova [2021] Daria Grechishnikova. Transformer neural network for protein-specific de novo drug generation as a machine translation problem. Scientific reports, 11(1):321, 2021.
  • Jumper et al. [2021] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Ding et al. [2023] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
  • Hu et al. [2022] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  • Tang et al. [2024] Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, and Dacheng Tao. A survey on transformer compression. arXiv preprint arXiv:2402.05964, 2024.
  • Nooralinejad et al. [2023] Parsa Nooralinejad, Ali Abbasi, Soroush Abbasi Koohpayegani, Kossar Pourahmadi Meibodi, Rana Muhammad Shahroz Khan, Soheil Kolouri, and Hamed Pirsiavash. PRANC: Pseudo random networks for compacting deep models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17021–17031, 2023.
  • Girish et al. [2023] Sharath Girish, Kamal Gupta, Saurabh Singh, and Abhinav Shrivastava. Lilnetx: Lightweight networks with EXtreme model compression and structured sparsification. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NVZvalzCLg.
  • Koohpayegani et al. [2024] Soroush Abbasi Koohpayegani, Navaneet K L, Parsa Nooralinejad, Soheil Kolouri, and Hamed Pirsiavash. NOLA: Compressing LoRA using linear combination of random basis. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=TjfXcDgvzk.
  • Dettmers et al. [2024] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  • Liu et al. [2022] Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85–116, 2022.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Dehghani et al. [2019] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyzdRiR9Y7.
  • Nowlan and Hinton [1992] Steven J Nowlan and Geoffrey E Hinton. Simplifying neural networks by soft weight-sharing. Neural Computation, 4(4):473–493, 1992.
  • Ullrich et al. [2017] Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=HJGwcKclx.
  • Chen et al. [2015] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In International conference on machine learning, pages 2285–2294. PMLR, 2015.
  • Gao et al. [2019] Shangqian Gao, Cheng Deng, and Heng Huang. Cross domain model compression by structurally weight sharing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8973–8982, 2019.
  • Plummer et al. [2022] Bryan A. Plummer, Nikoli Dryden, Julius Frost, Torsten Hoefler, and Kate Saenko. Neural parameter allocation search. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=srtIXtySfT4.
  • Shakerinava et al. [2024] Mehran Shakerinava, Motahareh MS Sohrabi, Siamak Ravanbakhsh, and Simon Lacoste-Julien. Weight-sharing regularization. In International Conference on Artificial Intelligence and Statistics, pages 4204–4212. PMLR, 2024.
  • Rastegari et al. [2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016.
  • Lee et al. [2021] Junghyup Lee, Dohyung Kim, and Bumsub Ham. Network quantization with element-wise gradient scaling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6448–6457, 2021.
  • Xiao et al. [2023] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 38087–38099. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/xiao23c.html.
  • shkolnik et al. [2020] moran shkolnik, Brian Chmiel, Ron Banner, Gil Shomron, Yury Nahshan, Alex Bronstein, and Uri Weiser. Robust quantization: One model to rule them all. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 5308–5317. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/3948ead63a9f2944218de038d8934305-Paper.pdf.
  • Yao et al. [2023] Zhewei Yao, Cheng Li, Xiaoxia Wu, Stephen Youn, and Yuxiong He. A comprehensive study on post-training quantization for large language models. arXiv preprint arXiv:2303.08302, 2023.
  • Kim et al. [2020] Woojeong Kim, Suhyun Kim, Mincheol Park, and Geunseok Jeon. Neuron merging: Compensating for pruned neurons. Advances in Neural Information Processing Systems, 33:585–595, 2020.
  • Lin et al. [2020] Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. Dynamic model pruning with feedback. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJem8lSFwB.
  • Siems et al. [2021] Julien Niklas Siems, Aaron Klein, Cedric Archambeau, and Maren Mahsereci. Dynamic pruning of a neural network via gradient signal-to-noise ratio. In 8th ICML Workshop on Automated Machine Learning (AutoML), 2021.
  • Tiwari et al. [2021] Rishabh Tiwari, Udbhav Bamba, Arnav Chavan, and Deepak K Gupta. Chipnet: Budget-aware pruning with heaviside continuous approximations. arXiv preprint arXiv:2102.07156, 2021.
  • Hayou et al. [2020] Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. Robust pruning at initialization. arXiv preprint arXiv:2002.08797, 2020.
  • Wang et al. [2020a] Huan Wang, Can Qin, Yulun Zhang, and Yun Fu. Neural pruning via growing regularization. arXiv preprint arXiv:2012.09243, 2020a.
  • Li et al. [2021] Yuchao Li, Shaohui Lin, Jianzhuang Liu, Qixiang Ye, Mengdi Wang, Fei Chao, Fan Yang, **cheng Ma, Qi Tian, and Rongrong Ji. Towards compact cnns via collaborative compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6438–6447, 2021.
  • Kusupati et al. [2020a] Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. In Proceedings of the International Conference on Machine Learning, July 2020a.
  • Isik et al. [2022] Berivan Isik, Tsachy Weissman, and Albert No. An information-theoretic justification for model pruning. In International Conference on Artificial Intelligence and Statistics, pages 3821–3846. PMLR, 2022.
  • Han et al. [2015] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  • Havasi et al. [2018] Marton Havasi, Robert Peharz, and José Miguel Hernández-Lobato. Minimal random code learning: Getting bits back from compressed model parameters. arXiv preprint arXiv:1810.00440, 2018.
  • Reagan et al. [2018] Brandon Reagan, Udit Gupta, Bob Adolf, Michael Mitzenmacher, Alexander Rush, Gu-Yeon Wei, and David Brooks. Weightless: Lossy weight encoding for deep neural network compression. In International Conference on Machine Learning, pages 4324–4333. PMLR, 2018.
  • Valipour et al. [2023] Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. Dylora: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3274–3287, 2023.
  • Zhang et al. [2023b] Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=lq62uWRJjiY.
  • Li et al. [2018] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryup8-WCW.
  • Fei et al. [2023] Yanhong Fei, Xian Wei, Yingjie Liu, Zhengyu Li, and Mingsong Chen. A survey of geometric optimization for deep learning: From euclidean space to riemannian manifold. arXiv preprint arXiv:2302.08210, 2023.
  • Ozay and Okatani [2016] Mete Ozay and Takayuki Okatani. Optimization on submanifolds of convolution kernels in cnns. arXiv preprint arXiv:1610.07008, 2016.
  • Huang et al. [2018] Lei Huang, Xianglong Liu, Bo Lang, Adams Yu, Yongliang Wang, and Bo Li. Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Wang et al. [2020b] Jiayun Wang, Yubei Chen, Rudrasis Chakraborty, and Stella X Yu. Orthogonal convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11505–11515, 2020b.
  • Absil et al. [2008] P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2008.
  • Boumal [2023] Nicolas Boumal. An introduction to optimization on smooth manifolds. Cambridge University Press, 2023.
  • Meghwanshi et al. [2018] Mayank Meghwanshi, Pratik Jawanpuria, Anoop Kunchukuttan, Hiroyuki Kasai, and Bamdev Mishra. Mctorch, a manifold optimization library for deep learning. arXiv preprint arXiv:1810.01811, 2018.
  • Kochurov et al. [2020] Max Kochurov, Rasul Karimov, and Serge Kozlukov. Geoopt: Riemannian optimization in pytorch. arXiv preprint arXiv:2005.02819, 2020.
  • Luchnikov et al. [2021] Ilia Luchnikov, Alexander Ryzhov, Sergey Filippov, and Henni Ouerdane. Qgopt: Riemannian optimization for quantum technologies. SciPost Physics, 10(3):079, 2021.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
  • Deshpande et al. [2018] Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3483–3491, 2018.
  • Kolouri et al. [2019] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced wasserstein distances. Advances in neural information processing systems, 32, 2019.
  • Sitzmann et al. [2020] Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. Advances in neural information processing systems, 33:7462–7473, 2020.
  • Tancik et al. [2020] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems, 33:7537–7547, 2020.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.longhoe.net/abs/1412.6980.
  • Kusupati et al. [2020b] Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. In International Conference on Machine Learning, pages 5544–5555. PMLR, 2020b.
  • Tian et al. [2020] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI, page 776–794, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58620-1. doi: 10.1007/978-3-030-58621-8_45. URL https://doi.org/10.1007/978-3-030-58621-8_45.
  • Touvron et al. [2021] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/touvron21a.html.
  • Taori et al. [2023] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  • Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  • Dettmers et al. [2023] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023.