-
Subtractive Training for Music Stem Insertion using Latent Diffusion Models
Authors:
Ivan Villa-Renteria,
Mason L. Wang,
Zachary Shah,
Zhe Li,
Soohyun Kim,
Neelesh Ramachandran,
Mert Pilanci
Abstract:
We present Subtractive Training, a simple and novel method for synthesizing individual musical instrument stems given other instruments as context. This method pairs a dataset of complete music mixes with 1) a variant of the dataset lacking a specific stem, and 2) LLM-generated instructions describing how the missing stem should be reintroduced. We then fine-tune a pretrained text-to-audio diffusi…
▽ More
We present Subtractive Training, a simple and novel method for synthesizing individual musical instrument stems given other instruments as context. This method pairs a dataset of complete music mixes with 1) a variant of the dataset lacking a specific stem, and 2) LLM-generated instructions describing how the missing stem should be reintroduced. We then fine-tune a pretrained text-to-audio diffusion model to generate the missing instrument stem, guided by both the existing stems and the text instruction. Our results demonstrate Subtractive Training's efficacy in creating authentic drum stems that seamlessly blend with the existing tracks. We also show that we can use the text instruction to control the generation of the inserted stem in terms of rhythm, dynamics, and genre, allowing us to modify the style of a single instrument in a full song while kee** the remaining instruments the same. Lastly, we extend this technique to MIDI formats, successfully generating compatible bass, drum, and guitar parts for incomplete arrangements.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Towards Signal Processing In Large Language Models
Authors:
Prateek Verma,
Mert Pilanci
Abstract:
This paper introduces the idea of applying signal processing inside a Large Language Model (LLM). With the recent explosion of generative AI, our work can help bridge two fields together, namely the field of signal processing and large language models. We draw parallels between classical Fourier-Transforms and Fourier Transform-like learnable time-frequency representations for every intermediate a…
▽ More
This paper introduces the idea of applying signal processing inside a Large Language Model (LLM). With the recent explosion of generative AI, our work can help bridge two fields together, namely the field of signal processing and large language models. We draw parallels between classical Fourier-Transforms and Fourier Transform-like learnable time-frequency representations for every intermediate activation signal of an LLM. Once we decompose every activation signal across tokens into a time-frequency representation, we learn how to filter and reconstruct them, with all components learned from scratch, to predict the next token given the previous context. We show that for GPT-like architectures, our work achieves faster convergence and significantly increases performance by adding a minuscule number of extra parameters when trained for the same epochs. We hope this work paves the way for algorithms exploring signal processing inside the signals found in neural architectures like LLMs and beyond.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
AdaPTwin: Low-Cost Adaptive Compression of Product Twins in Transformers
Authors:
Emil Biju,
Anirudh Sriram,
Mert Pilanci
Abstract:
While large transformer-based models have exhibited remarkable performance in speaker-independent speech recognition, their large size and computational requirements make them expensive or impractical to use in resource-constrained settings. In this work, we propose a low-rank adaptive compression technique called AdaPTwin that jointly compresses product-dependent pairs of weight matrices in the t…
▽ More
While large transformer-based models have exhibited remarkable performance in speaker-independent speech recognition, their large size and computational requirements make them expensive or impractical to use in resource-constrained settings. In this work, we propose a low-rank adaptive compression technique called AdaPTwin that jointly compresses product-dependent pairs of weight matrices in the transformer attention layer. Our approach can prioritize the compressed model's performance on a specific speaker while maintaining generalizability to new speakers and acoustic conditions. Notably, our technique requires only 8 hours of speech data for fine-tuning, which can be accomplished in under 20 minutes, making it highly cost-effective compared to other compression methods. We demonstrate the efficacy of our approach by compressing the Whisper and Distil-Whisper models by up to 45% while incurring less than a 2% increase in word error rate.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Randomized Geometric Algebra Methods for Convex Neural Networks
Authors:
Yifei Wang,
Sungyoon Kim,
Paul Chu,
Indu Subramaniam,
Mert Pilanci
Abstract:
We introduce randomized algorithms to Clifford's Geometric Algebra, generalizing randomized linear algebra to hypercomplex vector spaces. This novel approach has many implications in machine learning, including training neural networks to global optimality via convex optimization. Additionally, we consider fine-tuning large language model (LLM) embeddings as a key application area, exploring the i…
▽ More
We introduce randomized algorithms to Clifford's Geometric Algebra, generalizing randomized linear algebra to hypercomplex vector spaces. This novel approach has many implications in machine learning, including training neural networks to global optimality via convex optimization. Additionally, we consider fine-tuning large language model (LLM) embeddings as a key application area, exploring the intersection of geometric algebra and modern AI techniques. In particular, we conduct a comparative analysis of the robustness of transfer learning via embeddings, such as OpenAI GPT models and BERT, using traditional methods versus our novel approach based on convex optimization. We test our convex optimization transfer learning method across a variety of case studies, employing different embeddings (GPT-4 and BERT embeddings) and different text classification datasets (IMDb, Amazon Polarity Dataset, and GLUE) with a range of hyperparameter settings. Our results demonstrate that convex optimization and geometric algebra not only enhances the performance of LLMs but also offers a more stable and reliable method of transfer learning via embeddings.
△ Less
Submitted 8 June, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
Compressing Large Language Models using Low Rank and Low Precision Decomposition
Authors:
Rajarshi Saha,
Naomi Sagan,
Varun Srivastava,
Andrea J. Goldsmith,
Mert Pilanci
Abstract:
The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces $\rm CALDERA$ -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix $\mathbf{W}$ by approximating it via a low-rank, low-precision decomposition as…
▽ More
The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces $\rm CALDERA$ -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix $\mathbf{W}$ by approximating it via a low-rank, low-precision decomposition as $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$. Here, $\mathbf{L}$ and $\mathbf{R}$ are low rank factors, and the entries of $\mathbf{Q}$, $\mathbf{L}$ and $\mathbf{R}$ are quantized. The model is compressed by substituting each layer with its $\mathbf{Q} + \mathbf{L}\mathbf{R}$ decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, $\mathbf{L}$ and $\mathbf{R}$ are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. $\rm CALDERA$ obtains this decomposition by formulating it as an optimization problem $\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}^\top\rVert_{\rm F}^2$, where $\mathbf{X}$ is the calibration data, and $\mathbf{Q}, \mathbf{L}, \mathbf{R}$ are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of $\rm CALDERA$ are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-$2$ $7$B/$70$B and LlaMa-$3$ $8$B models obtained using $\rm CALDERA$ outperforms existing post-training LLM compression techniques in the regime of less than $2.5$ bits per parameter. The implementation is available at: \href{https://github.com/pilancilab/caldera}{https://github.com/pilancilab/caldera}.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Adversarial Training of Two-Layer Polynomial and ReLU Activation Networks via Convex Optimization
Authors:
Daniel Kuelbs,
Sanjay Lall,
Mert Pilanci
Abstract:
Training neural networks which are robust to adversarial attacks remains an important problem in deep learning, especially as heavily overparameterized models are adopted in safety-critical settings. Drawing from recent work which reformulates the training problems for two-layer ReLU and polynomial activation networks as convex programs, we devise a convex semidefinite program (SDP) for adversaria…
▽ More
Training neural networks which are robust to adversarial attacks remains an important problem in deep learning, especially as heavily overparameterized models are adopted in safety-critical settings. Drawing from recent work which reformulates the training problems for two-layer ReLU and polynomial activation networks as convex programs, we devise a convex semidefinite program (SDP) for adversarial training of polynomial activation networks via the S-procedure. We also derive a convex SDP to compute the minimum distance from a correctly classified example to the decision boundary of a polynomial activation network. Adversarial training for two-layer ReLU activation networks has been explored in the literature, but, in contrast to prior work, we present a scalable approach which is compatible with standard machine libraries and GPU acceleration. The adversarial training SDP for polynomial activation networks leads to large increases in robust test accuracy against $\ell^\infty$ attacks on the Breast Cancer Wisconsin dataset from the UCI Machine Learning Repository. For two-layer ReLU networks, we leverage our scalable implementation to retrain the final two fully connected layers of a Pre-Activation ResNet-18 model on the CIFAR-10 dataset. Our 'robustified' model achieves higher clean and robust test accuracies than the same architecture trained with sharpness-aware minimization.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Spectral Adapter: Fine-Tuning in Spectral Space
Authors:
Fangzhao Zhang,
Mert Pilanci
Abstract:
Recent developments in Parameter-Efficient Fine-Tuning (PEFT) methods for pretrained deep neural networks have captured widespread interest. In this work, we study the enhancement of current PEFT methods by incorporating the spectral information of pretrained weight matrices into the fine-tuning procedure. We investigate two spectral adaptation mechanisms, namely additive tuning and orthogonal rot…
▽ More
Recent developments in Parameter-Efficient Fine-Tuning (PEFT) methods for pretrained deep neural networks have captured widespread interest. In this work, we study the enhancement of current PEFT methods by incorporating the spectral information of pretrained weight matrices into the fine-tuning procedure. We investigate two spectral adaptation mechanisms, namely additive tuning and orthogonal rotation of the top singular vectors, both are done via first carrying out Singular Value Decomposition (SVD) of pretrained weights and then fine-tuning the top spectral space. We provide a theoretical analysis of spectral fine-tuning and show that our approach improves the rank capacity of low-rank adapters given a fixed trainable parameter budget. We show through extensive experiments that the proposed fine-tuning model enables better parameter efficiency and tuning performance as well as benefits multi-adapter fusion. The code will be open-sourced for reproducibility.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Faster Convergence of Stochastic Accelerated Gradient Descent under Interpolation
Authors:
Aaron Mishkin,
Mert Pilanci,
Mark Schmidt
Abstract:
We prove new convergence rates for a generalized version of stochastic Nesterov acceleration under interpolation conditions. Unlike previous analyses, our approach accelerates any stochastic gradient method which makes sufficient progress in expectation. The proof, which proceeds using the estimating sequences framework, applies to both convex and strongly convex functions and is easily specialize…
▽ More
We prove new convergence rates for a generalized version of stochastic Nesterov acceleration under interpolation conditions. Unlike previous analyses, our approach accelerates any stochastic gradient method which makes sufficient progress in expectation. The proof, which proceeds using the estimating sequences framework, applies to both convex and strongly convex functions and is easily specialized to accelerated SGD under the strong growth condition. In this special case, our analysis reduces the dependence on the strong growth constant from $ρ$ to $\sqrtρ$ as compared to prior work. This improvement is comparable to a square-root of the condition number in the worst case and address criticism that guarantees for stochastic acceleration could be worse than those for SGD.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
A Library of Mirrors: Deep Neural Nets in Low Dimensions are Convex Lasso Models with Reflection Features
Authors:
Emi Zeger,
Yifei Wang,
Aaron Mishkin,
Tolga Ergen,
Emmanuel Candès,
Mert Pilanci
Abstract:
We prove that training neural networks on 1-D data is equivalent to solving a convex Lasso problem with a fixed, explicitly defined dictionary matrix of features. The specific dictionary depends on the activation and depth. We consider 2-layer networks with piecewise linear activations, deep narrow ReLU networks with up to 4 layers, and rectangular and tree networks with sign activation and arbitr…
▽ More
We prove that training neural networks on 1-D data is equivalent to solving a convex Lasso problem with a fixed, explicitly defined dictionary matrix of features. The specific dictionary depends on the activation and depth. We consider 2-layer networks with piecewise linear activations, deep narrow ReLU networks with up to 4 layers, and rectangular and tree networks with sign activation and arbitrary depth. Interestingly in ReLU networks, a fourth layer creates features that represent reflections of training data about themselves. The Lasso representation sheds insight to globally optimal networks and the solution landscape.
△ Less
Submitted 18 March, 2024; v1 submitted 1 March, 2024;
originally announced March 2024.
-
Adaptive Inference: Theoretical Limits and Unexplored Opportunities
Authors:
Soheil Hor,
Ying Qian,
Mert Pilanci,
Amin Arbabian
Abstract:
This paper introduces the first theoretical framework for quantifying the efficiency and performance gain opportunity size of adaptive inference algorithms. We provide new approximate and exact bounds for the achievable efficiency and performance gains, supported by empirical evidence demonstrating the potential for 10-100x efficiency improvements in both Computer Vision and Natural Language Proce…
▽ More
This paper introduces the first theoretical framework for quantifying the efficiency and performance gain opportunity size of adaptive inference algorithms. We provide new approximate and exact bounds for the achievable efficiency and performance gains, supported by empirical evidence demonstrating the potential for 10-100x efficiency improvements in both Computer Vision and Natural Language Processing tasks without incurring any performance penalties. Additionally, we offer insights on improving achievable efficiency gains through the optimal selection and design of adaptive inference state spaces.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time
Authors:
Sungyoon Kim,
Mert Pilanci
Abstract:
In this paper, we study the optimality gap between two-layer ReLU networks regularized with weight decay and their convex relaxations. We show that when the training data is random, the relative optimality gap between the original problem and its relaxation can be bounded by a factor of O(log n^0.5), where n is the number of training samples. A simple application leads to a tractable polynomial-ti…
▽ More
In this paper, we study the optimality gap between two-layer ReLU networks regularized with weight decay and their convex relaxations. We show that when the training data is random, the relative optimality gap between the original problem and its relaxation can be bounded by a factor of O(log n^0.5), where n is the number of training samples. A simple application leads to a tractable polynomial-time algorithm that is guaranteed to solve the original non-convex problem up to a logarithmic factor. Moreover, under mild assumptions, we show that local gradient methods converge to a point with low training loss with high probability. Our result is an exponential improvement compared to existing results and sheds new light on understanding why local gradient methods work well.
△ Less
Submitted 5 June, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models
Authors:
Fangzhao Zhang,
Mert Pilanci
Abstract:
Low-Rank Adaptation (LoRA) emerges as a popular parameter-efficient fine-tuning (PEFT) method, which proposes to freeze pretrained model weights and update an additive low-rank trainable matrix. In this work, we study the enhancement of LoRA training by introducing an $r \times r$ preconditioner in each gradient step where $r$ is the LoRA rank. We theoretically verify that the proposed preconditio…
▽ More
Low-Rank Adaptation (LoRA) emerges as a popular parameter-efficient fine-tuning (PEFT) method, which proposes to freeze pretrained model weights and update an additive low-rank trainable matrix. In this work, we study the enhancement of LoRA training by introducing an $r \times r$ preconditioner in each gradient step where $r$ is the LoRA rank. We theoretically verify that the proposed preconditioner stabilizes feature learning with LoRA under infinite-width NN setting. Empirically, the implementation of this new preconditioner requires a small change to existing optimizer code and creates virtually minuscule storage and runtime overhead. Our experimental results with both large language models and text-to-image diffusion models show that with this new preconditioner, the convergence and reliability of SGD and AdamW can be significantly enhanced. Moreover, the training process becomes much more robust to hyperparameter choices such as learning rate. The new preconditioner can be derived from a novel Riemannian metric in low-rank matrix field. Code can be accessed at https://github.com/pilancilab/Riemannian_Preconditioned_LoRA.
△ Less
Submitted 5 June, 2024; v1 submitted 4 February, 2024;
originally announced February 2024.
-
Analyzing Neural Network-Based Generative Diffusion Models through Convex Optimization
Authors:
Fangzhao Zhang,
Mert Pilanci
Abstract:
Diffusion models are gaining widespread use in cutting-edge image, video, and audio generation. Score-based diffusion models stand out among these methods, necessitating the estimation of score function of the input data distribution. In this study, we present a theoretical framework to analyze two-layer neural network-based diffusion models by reframing score matching and denoising score matching…
▽ More
Diffusion models are gaining widespread use in cutting-edge image, video, and audio generation. Score-based diffusion models stand out among these methods, necessitating the estimation of score function of the input data distribution. In this study, we present a theoretical framework to analyze two-layer neural network-based diffusion models by reframing score matching and denoising score matching as convex optimization. We prove that training shallow neural networks for score prediction can be done by solving a single convex program. Although most analyses of diffusion models operate in the asymptotic setting or rely on approximations, we characterize the exact predicted score function and establish convergence results for neural network-based diffusion models with finite data. Our results provide a precise characterization of what neural network-based diffusion models learn in non-asymptotic settings.
△ Less
Submitted 22 May, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
Distributed Markov Chain Monte Carlo Sampling based on the Alternating Direction Method of Multipliers
Authors:
Alexandros E. Tzikas,
Licio Romao,
Mert Pilanci,
Alessandro Abate,
Mykel J. Kochenderfer
Abstract:
Many machine learning applications require operating on a spatially distributed dataset. Despite technological advances, privacy considerations and communication constraints may prevent gathering the entire dataset in a central unit. In this paper, we propose a distributed sampling scheme based on the alternating direction method of multipliers, which is commonly used in the optimization literatur…
▽ More
Many machine learning applications require operating on a spatially distributed dataset. Despite technological advances, privacy considerations and communication constraints may prevent gathering the entire dataset in a central unit. In this paper, we propose a distributed sampling scheme based on the alternating direction method of multipliers, which is commonly used in the optimization literature due to its fast convergence. In contrast to distributed optimization, distributed sampling allows for uncertainty quantification in Bayesian inference tasks. We provide both theoretical guarantees of our algorithm's convergence and experimental evidence of its superiority to the state-of-the-art. For our theoretical results, we use convex optimization tools to establish a fundamental inequality on the generated local sample iterates. This inequality enables us to show convergence of the distribution associated with these iterates to the underlying target distribution in Wasserstein distance. In simulation, we deploy our algorithm on linear and logistic regression tasks and illustrate its fast convergence compared to existing gradient-based methods.
△ Less
Submitted 28 January, 2024;
originally announced January 2024.
-
The Convex Landscape of Neural Networks: Characterizing Global Optima and Stationary Points via Lasso Models
Authors:
Tolga Ergen,
Mert Pilanci
Abstract:
Due to the non-convex nature of training Deep Neural Network (DNN) models, their effectiveness relies on the use of non-convex optimization heuristics. Traditional methods for training DNNs often require costly empirical methods to produce successful models and do not have a clear theoretical foundation. In this study, we examine the use of convex optimization theory and sparse recovery models to…
▽ More
Due to the non-convex nature of training Deep Neural Network (DNN) models, their effectiveness relies on the use of non-convex optimization heuristics. Traditional methods for training DNNs often require costly empirical methods to produce successful models and do not have a clear theoretical foundation. In this study, we examine the use of convex optimization theory and sparse recovery models to refine the training process of neural networks and provide a better interpretation of their optimal weights. We focus on training two-layer neural networks with piecewise linear activations and demonstrate that they can be formulated as a finite-dimensional convex program. These programs include a regularization term that promotes sparsity, which constitutes a variant of group Lasso. We first utilize semi-infinite programming theory to prove strong duality for finite width neural networks and then we express these architectures equivalently as high dimensional convex sparse recovery models. Remarkably, the worst-case complexity to solve the convex program is polynomial in the number of samples and number of neurons when the rank of the data matrix is bounded, which is the case in convolutional networks. To extend our method to training data of arbitrary rank, we develop a novel polynomial-time approximation scheme based on zonotope subsampling that comes with a guaranteed approximation ratio. We also show that all the stationary of the nonconvex training objective can be characterized as the global optimum of a subsampled convex program. Our convex models can be trained using standard convex solvers without resorting to heuristics or extensive hyper-parameter tuning unlike non-convex methods. Through extensive numerical experiments, we show that convex models can outperform traditional non-convex methods and are not sensitive to optimizer hyperparameters.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
Volumetric Reconstruction Resolves Off-Resonance Artifacts in Static and Dynamic PROPELLER MRI
Authors:
Annesha Ghosh,
Gordon Wetzstein,
Mert Pilanci,
Sara Fridovich-Keil
Abstract:
Off-resonance artifacts in magnetic resonance imaging (MRI) are visual distortions that occur when the actual resonant frequencies of spins within the imaging volume differ from the expected frequencies used to encode spatial information. These discrepancies can be caused by a variety of factors, including magnetic field inhomogeneities, chemical shifts, or susceptibility differences within the ti…
▽ More
Off-resonance artifacts in magnetic resonance imaging (MRI) are visual distortions that occur when the actual resonant frequencies of spins within the imaging volume differ from the expected frequencies used to encode spatial information. These discrepancies can be caused by a variety of factors, including magnetic field inhomogeneities, chemical shifts, or susceptibility differences within the tissues. Such artifacts can manifest as blurring, ghosting, or misregistration of the reconstructed image, and they often compromise its diagnostic quality. We propose to resolve these artifacts by lifting the 2D MRI reconstruction problem to 3D, introducing an additional "spectral" dimension to model this off-resonance. Our approach is inspired by recent progress in modeling radiance fields, and is capable of reconstructing both static and dynamic MR images as well as separating fat and water, which is of independent clinical interest. We demonstrate our approach in the context of PROPELLER (Periodically Rotated Overlap** ParallEL Lines with Enhanced Reconstruction) MRI acquisitions, which are popular for their robustness to motion artifacts. Our method operates in a few minutes on a single GPU, and to our knowledge is the first to correct for chemical shift in gradient echo PROPELLER MRI reconstruction without additional measurements or pretraining data.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
Polynomial-Time Solutions for ReLU Network Training: A Complexity Classification via Max-Cut and Zonotopes
Authors:
Yifei Wang,
Mert Pilanci
Abstract:
We investigate the complexity of training a two-layer ReLU neural network with weight decay regularization. Previous research has shown that the optimal solution of this problem can be found by solving a standard cone-constrained convex program. Using this convex formulation, we prove that the hardness of approximation of ReLU networks not only mirrors the complexity of the Max-Cut problem but als…
▽ More
We investigate the complexity of training a two-layer ReLU neural network with weight decay regularization. Previous research has shown that the optimal solution of this problem can be found by solving a standard cone-constrained convex program. Using this convex formulation, we prove that the hardness of approximation of ReLU networks not only mirrors the complexity of the Max-Cut problem but also, in certain special cases, exactly corresponds to it. In particular, when $ε\leq\sqrt{84/83}-1\approx 0.006$, we show that it is NP-hard to find an approximate global optimizer of the ReLU network objective with relative error $ε$ with respect to the objective value. Moreover, we develop a randomized algorithm which mirrors the Goemans-Williamson rounding of semidefinite Max-Cut relaxations. To provide polynomial-time approximations, we classify training datasets into three categories: (i) For orthogonal separable datasets, a precise solution can be obtained in polynomial-time. (ii) When there is a negative correlation between samples of different classes, we give a polynomial-time approximation with relative error $\sqrt{π/2}-1\approx 0.253$. (iii) For general datasets, the degree to which the problem can be approximated in polynomial-time is governed by a geometric factor that controls the diameter of two zonotopes intrinsic to the dataset. To our knowledge, these results present the first polynomial-time approximation guarantees along with first hardness of approximation results for regularized ReLU networks.
△ Less
Submitted 17 November, 2023;
originally announced November 2023.
-
Matrix Compression via Randomized Low Rank and Low Precision Factorization
Authors:
Rajarshi Saha,
Varun Srivastava,
Mert Pilanci
Abstract:
Matrices are exceptionally useful in various fields of study as they provide a convenient framework to organize and manipulate data in a structured manner. However, modern matrices can involve billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Although prohibitively large, such matrices are often approximately low rank. W…
▽ More
Matrices are exceptionally useful in various fields of study as they provide a convenient framework to organize and manipulate data in a structured manner. However, modern matrices can involve billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Although prohibitively large, such matrices are often approximately low rank. We propose an algorithm that exploits this structure to obtain a low rank decomposition of any matrix $\mathbf{A}$ as $\mathbf{A} \approx \mathbf{L}\mathbf{R}$, where $\mathbf{L}$ and $\mathbf{R}$ are the low rank factors. The total number of elements in $\mathbf{L}$ and $\mathbf{R}$ can be significantly less than that in $\mathbf{A}$. Furthermore, the entries of $\mathbf{L}$ and $\mathbf{R}$ are quantized to low precision formats $--$ compressing $\mathbf{A}$ by giving us a low rank and low precision factorization. Our algorithm first computes an approximate basis of the range space of $\mathbf{A}$ by randomly sketching its columns, followed by a quantization of the vectors constituting this basis. It then computes approximate projections of the columns of $\mathbf{A}$ onto this quantized basis. We derive upper bounds on the approximation error of our algorithm, and analyze the impact of target rank and quantization bit-budget. The tradeoff between compression ratio and approximation accuracy allows for flexibility in choosing these parameters based on specific application requirements. We empirically demonstrate the efficacy of our algorithm in image compression, nearest neighbor classification of image and text embeddings, and compressing the layers of LlaMa-$7$b. Our results illustrate that we can achieve compression ratios as aggressive as one bit per matrix coordinate, all while surpassing or maintaining the performance of traditional compression techniques.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
From Complexity to Clarity: Analytical Expressions of Deep Neural Network Weights via Clifford's Geometric Algebra and Convexity
Authors:
Mert Pilanci
Abstract:
In this paper, we introduce a novel analysis of neural networks based on geometric (Clifford) algebra and convex optimization. We show that optimal weights of deep ReLU neural networks are given by the wedge product of training samples when trained with standard regularized loss. Furthermore, the training problem reduces to convex optimization over wedge product features, which encode the geometri…
▽ More
In this paper, we introduce a novel analysis of neural networks based on geometric (Clifford) algebra and convex optimization. We show that optimal weights of deep ReLU neural networks are given by the wedge product of training samples when trained with standard regularized loss. Furthermore, the training problem reduces to convex optimization over wedge product features, which encode the geometric structure of the training dataset. This structure is given in terms of signed volumes of triangles and parallelotopes generated by data vectors. The convex problem finds a small subset of samples via $\ell_1$ regularization to discover only relevant wedge product features. Our analysis provides a novel perspective on the inner workings of deep neural networks and sheds light on the role of the hidden layers.
△ Less
Submitted 22 March, 2024; v1 submitted 28 September, 2023;
originally announced September 2023.
-
Fixing the NTK: From Neural Network Linearizations to Exact Convex Programs
Authors:
Rajat Vadiraj Dwaraknath,
Tolga Ergen,
Mert Pilanci
Abstract:
Recently, theoretical analyses of deep neural networks have broadly focused on two directions: 1) Providing insight into neural network training by SGD in the limit of infinite hidden-layer width and infinitesimally small learning rate (also known as gradient flow) via the Neural Tangent Kernel (NTK), and 2) Globally optimizing the regularized training objective via cone-constrained convex reformu…
▽ More
Recently, theoretical analyses of deep neural networks have broadly focused on two directions: 1) Providing insight into neural network training by SGD in the limit of infinite hidden-layer width and infinitesimally small learning rate (also known as gradient flow) via the Neural Tangent Kernel (NTK), and 2) Globally optimizing the regularized training objective via cone-constrained convex reformulations of ReLU networks. The latter research direction also yielded an alternative formulation of the ReLU network, called a gated ReLU network, that is globally optimizable via efficient unconstrained convex programs. In this work, we interpret the convex program for this gated ReLU network as a Multiple Kernel Learning (MKL) model with a weighted data masking feature map and establish a connection to the NTK. Specifically, we show that for a particular choice of mask weights that do not depend on the learning targets, this kernel is equivalent to the NTK of the gated ReLU network on the training data. A consequence of this lack of dependence on the targets is that the NTK cannot perform better than the optimal MKL kernel on the training set. By using iterative reweighting, we improve the weights induced by the NTK to obtain the optimal MKL kernel which is equivalent to the solution of the exact convex reformulation of the gated ReLU network. We also provide several numerical simulations corroborating our theory. Additionally, we provide an analysis of the prediction error of the resulting optimal kernel via consistency results for the group lasso.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
Randomized Polar Codes for Anytime Distributed Machine Learning
Authors:
Burak Bartan,
Mert Pilanci
Abstract:
We present a novel distributed computing framework that is robust to slow compute nodes, and is capable of both approximate and exact computation of linear operations. The proposed mechanism integrates the concepts of randomized sketching and polar codes in the context of coded computation. We propose a sequential decoding algorithm designed to handle real valued data while maintaining low computa…
▽ More
We present a novel distributed computing framework that is robust to slow compute nodes, and is capable of both approximate and exact computation of linear operations. The proposed mechanism integrates the concepts of randomized sketching and polar codes in the context of coded computation. We propose a sequential decoding algorithm designed to handle real valued data while maintaining low computational complexity for recovery. Additionally, we provide an anytime estimator that can generate provably accurate estimates even when the set of available node outputs is not decodable. We demonstrate the potential applications of this framework in various contexts, such as large-scale matrix multiplication and black-box optimization. We present the implementation of these methods on a serverless cloud computing system and provide numerical results to demonstrate their scalability in practice, including ImageNet scale computations.
△ Less
Submitted 1 September, 2023;
originally announced September 2023.
-
Iterative Sketching for Secure Coded Regression
Authors:
Neophytos Charalambides,
Hessam Mahdavifar,
Mert Pilanci,
Alfred O. Hero III
Abstract:
Linear regression is a fundamental and primitive problem in supervised machine learning, with applications ranging from epidemiology to finance. In this work, we propose methods for speeding up distributed linear regression. We do so by leveraging randomized techniques, while also ensuring security and straggler resiliency in asynchronous distributed computing systems. Specifically, we randomly ro…
▽ More
Linear regression is a fundamental and primitive problem in supervised machine learning, with applications ranging from epidemiology to finance. In this work, we propose methods for speeding up distributed linear regression. We do so by leveraging randomized techniques, while also ensuring security and straggler resiliency in asynchronous distributed computing systems. Specifically, we randomly rotate the basis of the system of equations and then subsample blocks, to simultaneously secure the information and reduce the dimension of the regression problem. In our setup, the basis rotation corresponds to an encoded encryption in an approximate gradient coding scheme, and the subsampling corresponds to the responses of the non-straggling servers in the centralized coded computing framework. This results in a distributive iterative stochastic approach for matrix compression and steepest descent.
△ Less
Submitted 31 March, 2024; v1 submitted 8 August, 2023;
originally announced August 2023.
-
Gradient Coding with Iterative Block Leverage Score Sampling
Authors:
Neophytos Charalambides,
Mert Pilanci,
Alfred Hero
Abstract:
We generalize the leverage score sampling sketch for $\ell_2$-subspace embeddings, to accommodate sampling subsets of the transformed data, so that the sketching approach is appropriate for distributed settings. This is then used to derive an approximate coded computing approach for first-order methods; known as gradient coding, to accelerate linear regression in the presence of failures in distri…
▽ More
We generalize the leverage score sampling sketch for $\ell_2$-subspace embeddings, to accommodate sampling subsets of the transformed data, so that the sketching approach is appropriate for distributed settings. This is then used to derive an approximate coded computing approach for first-order methods; known as gradient coding, to accelerate linear regression in the presence of failures in distributed computational networks, \textit{i.e.} stragglers. We replicate the data across the distributed network, to attain the approximation guarantees through the induced sampling distribution. The significance and main contribution of this work, is that it unifies randomized numerical linear algebra with approximate coded computing, while attaining an induced $\ell_2$-subspace embedding through uniform sampling. The transition to uniform sampling is done without applying a random projection, as in the case of the subsampled randomized Hadamard transform. Furthermore, by incorporating this technique to coded computing, our scheme is an iterative sketching approach to approximately solving linear regression. We also propose weighting when sketching takes place through sampling with replacement, for further compression.
△ Less
Submitted 25 June, 2024; v1 submitted 6 August, 2023;
originally announced August 2023.
-
Optimal Sets and Solution Paths of ReLU Networks
Authors:
Aaron Mishkin,
Mert Pilanci
Abstract:
We develop an analytical framework to characterize the set of optimal ReLU neural networks by reformulating the non-convex training problem as a convex program. We show that the global optima of the convex parameterization are given by a polyhedral set and then extend this characterization to the optimal set of the non-convex training objective. Since all stationary points of the ReLU training pro…
▽ More
We develop an analytical framework to characterize the set of optimal ReLU neural networks by reformulating the non-convex training problem as a convex program. We show that the global optima of the convex parameterization are given by a polyhedral set and then extend this characterization to the optimal set of the non-convex training objective. Since all stationary points of the ReLU training problem can be represented as optima of sub-sampled convex programs, our work provides a general expression for all critical points of the non-convex objective. We then leverage our results to provide an optimal pruning algorithm for computing minimal networks, establish conditions for the regularization path of ReLU networks to be continuous, and develop sensitivity results for minimal ReLU networks.
△ Less
Submitted 19 January, 2024; v1 submitted 31 May, 2023;
originally announced June 2023.
-
Globally Optimal Training of Neural Networks with Threshold Activation Functions
Authors:
Tolga Ergen,
Halil Ibrahim Gulluk,
Jonathan Lacotte,
Mert Pilanci
Abstract:
Threshold activation functions are highly preferable in neural networks due to their efficiency in hardware implementations. Moreover, their mode of operation is more interpretable and resembles that of biological neurons. However, traditional gradient based algorithms such as Gradient Descent cannot be used to train the parameters of neural networks with threshold activations since the activation…
▽ More
Threshold activation functions are highly preferable in neural networks due to their efficiency in hardware implementations. Moreover, their mode of operation is more interpretable and resembles that of biological neurons. However, traditional gradient based algorithms such as Gradient Descent cannot be used to train the parameters of neural networks with threshold activations since the activation function has zero gradient except at a single non-differentiable point. To this end, we study weight decay regularized training problems of deep neural networks with threshold activations. We first show that regularized deep threshold network training problems can be equivalently formulated as a standard convex optimization problem, which parallels the LASSO method, provided that the last hidden layer width exceeds a certain threshold. We also derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network. We corroborate our theoretical results with various numerical experiments.
△ Less
Submitted 6 March, 2023;
originally announced March 2023.
-
Complex Clip** for Improved Generalization in Machine Learning
Authors:
Les Atlas,
Nicholas Rasmussen,
Felix Schwock,
Mert Pilanci
Abstract:
For many machine learning applications, a common input representation is a spectrogram. The underlying representation for a spectrogram is a short time Fourier transform (STFT) which gives complex values. The spectrogram uses the magnitude of these complex values, a commonly used detector. Modern machine learning systems are commonly overparameterized, where possible ill-conditioning problems are…
▽ More
For many machine learning applications, a common input representation is a spectrogram. The underlying representation for a spectrogram is a short time Fourier transform (STFT) which gives complex values. The spectrogram uses the magnitude of these complex values, a commonly used detector. Modern machine learning systems are commonly overparameterized, where possible ill-conditioning problems are ameliorated by regularization. The common use of rectified linear unit (ReLU) activation functions between layers of a deep net has been shown to help this regularization, improving system performance. We extend this idea of ReLU activation to detection for the complex STFT, providing a simple-to-compute modified and regularized spectrogram, which potentially results in better behaved training. We then confirmed the benefit of this approach on a noisy acoustic data set used for a real-world application. Generalization performance improved substantially. This approach might benefit other applications which use time-frequency map**s, for acoustic, audio, and other applications.
△ Less
Submitted 27 February, 2023;
originally announced February 2023.
-
Securely Aggregated Coded Matrix Inversion
Authors:
Neophytos Charalambides,
Mert Pilanci,
Alfred Hero
Abstract:
Coded computing is a method for mitigating straggling workers in a centralized computing network, by using erasure-coding techniques. Federated learning is a decentralized model for training data distributed across client devices. In this work we propose approximating the inverse of an aggregated data matrix, where the data is generated by clients; similar to the federated learning paradigm, while…
▽ More
Coded computing is a method for mitigating straggling workers in a centralized computing network, by using erasure-coding techniques. Federated learning is a decentralized model for training data distributed across client devices. In this work we propose approximating the inverse of an aggregated data matrix, where the data is generated by clients; similar to the federated learning paradigm, while also being resilient to stragglers. To do so, we propose a coded computing method based on gradient coding. We modify this method so that the coordinator does not access the local data at any point; while the clients access the aggregated matrix in order to complete their tasks. The network we consider is not centrally administrated, and the communications which take place are secure against potential eavesdroppers.
△ Less
Submitted 4 September, 2023; v1 submitted 9 January, 2023;
originally announced January 2023.
-
Overparameterized ReLU Neural Networks Learn the Simplest Models: Neural Isometry and Exact Recovery
Authors:
Yifei Wang,
Yixuan Hua,
Emmanuel Candés,
Mert Pilanci
Abstract:
The practice of deep learning has shown that neural networks generalize remarkably well even with an extreme number of learned parameters. This appears to contradict traditional statistical wisdom, in which a trade-off between model complexity and fit to the data is essential. We aim to address this discrepancy by adopting a convex optimization and sparse recovery perspective. We consider the trai…
▽ More
The practice of deep learning has shown that neural networks generalize remarkably well even with an extreme number of learned parameters. This appears to contradict traditional statistical wisdom, in which a trade-off between model complexity and fit to the data is essential. We aim to address this discrepancy by adopting a convex optimization and sparse recovery perspective. We consider the training and generalization properties of two-layer ReLU networks with standard weight decay regularization. Under certain regularity assumptions on the data, we show that ReLU networks with an arbitrary number of parameters learn only simple models that explain the data. This is analogous to the recovery of the sparsest linear model in compressed sensing. For ReLU networks and their variants with skip connections or normalization layers, we present isometry conditions that ensure the exact recovery of planted neurons. For randomly generated data, we show the existence of a phase transition in recovering planted neural network models, which is easy to describe: whenever the ratio between the number of samples and the dimension exceeds a numerical threshold, the recovery succeeds with high probability; otherwise, it fails with high probability. Surprisingly, ReLU networks learn simple and sparse models that generalize well even when the labels are noisy . The phase transition phenomenon is confirmed through numerical experiments.
△ Less
Submitted 17 February, 2023; v1 submitted 30 September, 2022;
originally announced September 2022.
-
GLEAM: Greedy Learning for Large-Scale Accelerated MRI Reconstruction
Authors:
Batu Ozturkler,
Arda Sahiner,
Tolga Ergen,
Arjun D Desai,
Christopher M Sandino,
Shreyas Vasanawala,
John M Pauly,
Morteza Mardani,
Mert Pilanci
Abstract:
Unrolled neural networks have recently achieved state-of-the-art accelerated MRI reconstruction. These networks unroll iterative optimization algorithms by alternating between physics-based consistency and neural-network based regularization. However, they require several iterations of a large neural network to handle high-dimensional imaging tasks such as 3D MRI. This limits traditional training…
▽ More
Unrolled neural networks have recently achieved state-of-the-art accelerated MRI reconstruction. These networks unroll iterative optimization algorithms by alternating between physics-based consistency and neural-network based regularization. However, they require several iterations of a large neural network to handle high-dimensional imaging tasks such as 3D MRI. This limits traditional training algorithms based on backpropagation due to prohibitively large memory and compute requirements for calculating gradients and storing intermediate activations. To address this challenge, we propose Greedy LEarning for Accelerated MRI (GLEAM) reconstruction, an efficient training strategy for high-dimensional imaging settings. GLEAM splits the end-to-end network into decoupled network modules. Each module is optimized in a greedy manner with decoupled gradient updates, reducing the memory footprint during training. We show that the decoupled gradient updates can be performed in parallel on multiple graphical processing units (GPUs) to further reduce training time. We present experiments with 2D and 3D datasets including multi-coil knee, brain, and dynamic cardiac cine MRI. We observe that: i) GLEAM generalizes as well as state-of-the-art memory-efficient baselines such as gradient checkpointing and invertible networks with the same memory footprint, but with 1.3x faster training; ii) for the same memory footprint, GLEAM yields 1.1dB PSNR gain in 2D and 1.8 dB in 3D over end-to-end baselines.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
Secure Linear MDS Coded Matrix Inversion
Authors:
Neophytos Charalambides,
Mert Pilanci,
Alfred Hero
Abstract:
A cumbersome operation in many scientific fields, is inverting large full-rank matrices. In this paper, we propose a coded computing approach for recovering matrix inverse approximations. We first present an approximate matrix inversion algorithm which does not require a matrix factorization, but uses a black-box least squares optimization solver as a subroutine, to give an estimate of the inverse…
▽ More
A cumbersome operation in many scientific fields, is inverting large full-rank matrices. In this paper, we propose a coded computing approach for recovering matrix inverse approximations. We first present an approximate matrix inversion algorithm which does not require a matrix factorization, but uses a black-box least squares optimization solver as a subroutine, to give an estimate of the inverse of a real full-rank matrix. We then present a distributed framework for which our algorithm can be implemented, and show how we can leverage sparsest-balanced MDS generator matrices to devise matrix inversion coded computing schemes. We focus on balanced Reed-Solomon codes, which are optimal in terms of computational load; and communication from the workers to the master server. We also discuss how our algorithms can be used to compute the pseudoinverse of a full-rank matrix, and how the communication is secured from eavesdroppers.
△ Less
Submitted 20 December, 2022; v1 submitted 13 July, 2022;
originally announced July 2022.
-
Optimal Neural Network Approximation of Wasserstein Gradient Direction via Convex Optimization
Authors:
Yifei Wang,
Peng Chen,
Mert Pilanci,
Wuchen Li
Abstract:
The computation of Wasserstein gradient direction is essential for posterior sampling problems and scientific computing. The approximation of the Wasserstein gradient with finite samples requires solving a variational problem. We study the variational problem in the family of two-layer networks with squared-ReLU activations, towards which we derive a semi-definite programming (SDP) relaxation. Thi…
▽ More
The computation of Wasserstein gradient direction is essential for posterior sampling problems and scientific computing. The approximation of the Wasserstein gradient with finite samples requires solving a variational problem. We study the variational problem in the family of two-layer networks with squared-ReLU activations, towards which we derive a semi-definite programming (SDP) relaxation. This SDP can be viewed as an approximation of the Wasserstein gradient in a broader function family including two-layer networks. By solving the convex SDP, we obtain the optimal approximation of the Wasserstein gradient direction in this class of functions. Numerical experiments including PDE-constrained Bayesian inference and parameter estimation in COVID-19 modeling demonstrate the effectiveness of the proposed method.
△ Less
Submitted 25 May, 2022;
originally announced May 2022.
-
Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers
Authors:
Arda Sahiner,
Tolga Ergen,
Batu Ozturkler,
John Pauly,
Morteza Mardani,
Mert Pilanci
Abstract:
Vision transformers using self-attention or its proposed alternatives have demonstrated promising results in many image related tasks. However, the underpinning inductive bias of attention is not well understood. To address this issue, this paper analyzes attention through the lens of convex duality. For the non-linear dot-product self-attention, and alternative mechanisms such as MLP-mixer and Fo…
▽ More
Vision transformers using self-attention or its proposed alternatives have demonstrated promising results in many image related tasks. However, the underpinning inductive bias of attention is not well understood. To address this issue, this paper analyzes attention through the lens of convex duality. For the non-linear dot-product self-attention, and alternative mechanisms such as MLP-mixer and Fourier Neural Operator (FNO), we derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality. The convex programs lead to {\it block nuclear-norm regularization} that promotes low rank in the latent feature and token dimensions. In particular, we show how self-attention networks implicitly clusters the tokens, based on their latent similarity. We conduct experiments for transferring a pre-trained transformer backbone for CIFAR-100 classification by fine-tuning a variety of convex attention heads. The results indicate the merits of the bias induced by attention compared with the existing MLP or linear heads.
△ Less
Submitted 20 May, 2022; v1 submitted 17 May, 2022;
originally announced May 2022.
-
Scale-Equivariant Unrolled Neural Networks for Data-Efficient Accelerated MRI Reconstruction
Authors:
Beliz Gunel,
Arda Sahiner,
Arjun D. Desai,
Akshay S. Chaudhari,
Shreyas Vasanawala,
Mert Pilanci,
John Pauly
Abstract:
Unrolled neural networks have enabled state-of-the-art reconstruction performance and fast inference times for the accelerated magnetic resonance imaging (MRI) reconstruction task. However, these approaches depend on fully-sampled scans as ground truth data which is either costly or not possible to acquire in many clinical medical imaging applications; hence, reducing dependence on data is desirab…
▽ More
Unrolled neural networks have enabled state-of-the-art reconstruction performance and fast inference times for the accelerated magnetic resonance imaging (MRI) reconstruction task. However, these approaches depend on fully-sampled scans as ground truth data which is either costly or not possible to acquire in many clinical medical imaging applications; hence, reducing dependence on data is desirable. In this work, we propose modeling the proximal operators of unrolled neural networks with scale-equivariant convolutional neural networks in order to improve the data-efficiency and robustness to drifts in scale of the images that might stem from the variability of patient anatomies or change in field-of-view across different MRI scanners. Our approach demonstrates strong improvements over the state-of-the-art unrolled neural networks under the same memory constraints both with and without data augmentations on both in-distribution and out-of-distribution scaled images without significantly increasing the train or inference time.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
Approximate Function Evaluation via Multi-Armed Bandits
Authors:
Tavor Z. Baharav,
Gary Cheng,
Mert Pilanci,
David Tse
Abstract:
We study the problem of estimating the value of a known smooth function $f$ at an unknown point $\boldsymbolμ \in \mathbb{R}^n$, where each component $μ_i$ can be sampled via a noisy oracle. Sampling more frequently components of $\boldsymbolμ$ corresponding to directions of the function with larger directional derivatives is more sample-efficient. However, as $\boldsymbolμ$ is unknown, the optima…
▽ More
We study the problem of estimating the value of a known smooth function $f$ at an unknown point $\boldsymbolμ \in \mathbb{R}^n$, where each component $μ_i$ can be sampled via a noisy oracle. Sampling more frequently components of $\boldsymbolμ$ corresponding to directions of the function with larger directional derivatives is more sample-efficient. However, as $\boldsymbolμ$ is unknown, the optimal sampling frequencies are also unknown. We design an instance-adaptive algorithm that learns to sample according to the importance of each coordinate, and with probability at least $1-δ$ returns an $ε$ accurate estimate of $f(\boldsymbolμ)$. We generalize our algorithm to adapt to heteroskedastic noise, and prove asymptotic optimality when $f$ is linear. We corroborate our theoretical results with numerical experiments, showing the dramatic gains afforded by adaptivity.
△ Less
Submitted 18 March, 2022;
originally announced March 2022.
-
Distributed Sketching for Randomized Optimization: Exact Characterization, Concentration and Lower Bounds
Authors:
Burak Bartan,
Mert Pilanci
Abstract:
We consider distributed optimization methods for problems where forming the Hessian is computationally challenging and communication is a significant bottleneck. We leverage randomized sketches for reducing the problem dimensions as well as preserving privacy and improving straggler resilience in asynchronous distributed systems. We derive novel approximation guarantees for classical sketching met…
▽ More
We consider distributed optimization methods for problems where forming the Hessian is computationally challenging and communication is a significant bottleneck. We leverage randomized sketches for reducing the problem dimensions as well as preserving privacy and improving straggler resilience in asynchronous distributed systems. We derive novel approximation guarantees for classical sketching methods and establish tight concentration results that serve as both upper and lower bounds on the error. We then extend our analysis to the accuracy of parameter averaging for distributed sketches. Furthermore, we develop unbiased parameter averaging methods for randomized second order optimization for regularized problems that employ sketching of the Hessian. Existing works do not take the bias of the estimators into consideration, which limits their application to massively parallel computation. We provide closed-form formulas for regularization parameters and step sizes that provably minimize the bias for sketched Newton directions. Additionally, we demonstrate the implications of our theoretical findings via large scale experiments on a serverless cloud computing platform.
△ Less
Submitted 18 March, 2022;
originally announced March 2022.
-
Minimax Optimal Quantization of Linear Models: Information-Theoretic Limits and Efficient Algorithms
Authors:
Rajarshi Saha,
Mert Pilanci,
Andrea J. Goldsmith
Abstract:
High-dimensional models often have a large memory footprint and must be quantized after training before being deployed on resource-constrained edge devices for inference tasks. In this work, we develop an information-theoretic framework for the problem of quantizing a linear regressor learned from training data $(\mathbf{X}, \mathbf{y})$, for some underlying statistical relationship…
▽ More
High-dimensional models often have a large memory footprint and must be quantized after training before being deployed on resource-constrained edge devices for inference tasks. In this work, we develop an information-theoretic framework for the problem of quantizing a linear regressor learned from training data $(\mathbf{X}, \mathbf{y})$, for some underlying statistical relationship $\mathbf{y} = \mathbf{X}\boldsymbolθ + \mathbf{v}$. The learned model, which is an estimate of the latent parameter $\boldsymbolθ \in \mathbb{R}^d$, is constrained to be representable using only $Bd$ bits, where $B \in (0, \infty)$ is a pre-specified budget and $d$ is the dimension. We derive an information-theoretic lower bound for the minimax risk under this setting and propose a matching upper bound using randomized embedding-based algorithms which is tight up to constant factors. The lower and upper bounds together characterize the minimum threshold bit-budget required to achieve a performance risk comparable to the unquantized setting. We also propose randomized Hadamard embeddings that are computationally efficient and are optimal up to a mild logarithmic factor of the lower bound. Our model quantization strategy can be generalized and we show its efficacy by extending the method and upper-bounds to two-layer ReLU neural networks for non-linear regression. Numerical simulations show the improved performance of our proposed scheme as well as its closeness to the lower bound.
△ Less
Submitted 30 August, 2022; v1 submitted 22 February, 2022;
originally announced February 2022.
-
Fast Convex Optimization for Two-Layer ReLU Networks: Equivalent Model Classes and Cone Decompositions
Authors:
Aaron Mishkin,
Arda Sahiner,
Mert Pilanci
Abstract:
We develop fast algorithms and robust software for convex optimization of two-layer neural networks with ReLU activation functions. Our work leverages a convex reformulation of the standard weight-decay penalized training problem as a set of group-$\ell_1$-regularized data-local models, where locality is enforced by polyhedral cone constraints. In the special case of zero-regularization, we show t…
▽ More
We develop fast algorithms and robust software for convex optimization of two-layer neural networks with ReLU activation functions. Our work leverages a convex reformulation of the standard weight-decay penalized training problem as a set of group-$\ell_1$-regularized data-local models, where locality is enforced by polyhedral cone constraints. In the special case of zero-regularization, we show that this problem is exactly equivalent to unconstrained optimization of a convex "gated ReLU" network. For problems with non-zero regularization, we show that convex gated ReLU models obtain data-dependent approximation bounds for the ReLU training problem. To optimize the convex reformulations, we develop an accelerated proximal gradient method and a practical augmented Lagrangian solver. We show that these approaches are faster than standard training heuristics for the non-convex problem, such as SGD, and outperform commercial interior-point solvers. Experimentally, we verify our theoretical results, explore the group-$\ell_1$ regularization path, and scale convex optimization for neural networks to image classification on MNIST and CIFAR-10.
△ Less
Submitted 31 August, 2022; v1 submitted 2 February, 2022;
originally announced February 2022.
-
Using a Novel COVID-19 Calculator to Measure Positive U.S. Socio-Economic Impact of a COVID-19 Pre-Screening Solution (AI/ML)
Authors:
Richard Swartzbaugh,
Amil Khanzada,
Praveen Govindan,
Mert Pilanci,
Ayomide Owoyemi,
Les Atlas,
Hugo Estrada,
Richard Nall,
Michael Lotito,
Rich Falcone,
Jennifer Ranjani J
Abstract:
The COVID-19 pandemic has been a scourge upon humanity, claiming the lives of more than 5.1 million people worldwide; the global economy contracted by 3.5% in 2020. This paper presents a COVID-19 calculator, synthesizing existing published calculators and data points, to measure the positive U.S. socio-economic impact of a COVID-19 AI/ML pre-screening solution (algorithm & application).
The COVID-19 pandemic has been a scourge upon humanity, claiming the lives of more than 5.1 million people worldwide; the global economy contracted by 3.5% in 2020. This paper presents a COVID-19 calculator, synthesizing existing published calculators and data points, to measure the positive U.S. socio-economic impact of a COVID-19 AI/ML pre-screening solution (algorithm & application).
△ Less
Submitted 4 April, 2022; v1 submitted 20 January, 2022;
originally announced January 2022.
-
Orthonormal Sketches for Secure Coded Regression
Authors:
Neophytos Charalambides,
Hessam Mahdavifar,
Mert Pilanci,
Alfred O. Hero III
Abstract:
In this work, we propose a method for speeding up linear regression distributively, while ensuring security. We leverage randomized sketching techniques, and improve straggler resilience in asynchronous systems. Specifically, we apply a random orthonormal matrix and then subsample in \textit{blocks}, to simultaneously secure the information and reduce the dimension of the regression problem. In ou…
▽ More
In this work, we propose a method for speeding up linear regression distributively, while ensuring security. We leverage randomized sketching techniques, and improve straggler resilience in asynchronous systems. Specifically, we apply a random orthonormal matrix and then subsample in \textit{blocks}, to simultaneously secure the information and reduce the dimension of the regression problem. In our setup, the transformation corresponds to an encoded encryption in an \textit{approximate} gradient coding scheme, and the subsampling corresponds to the responses of the non-straggling workers; in a centralized coded computing network. We focus on the special case of the \textit{Subsampled Randomized Hadamard Transform}, which we generalize to block sampling; and discuss how it can be used to secure the data. We illustrate the performance through numerical experiments.
△ Less
Submitted 22 February, 2022; v1 submitted 20 January, 2022;
originally announced January 2022.
-
Using Deep Learning with Large Aggregated Datasets for COVID-19 Classification from Cough
Authors:
Esin Darici Haritaoglu,
Nicholas Rasmussen,
Daniel C. H. Tan,
Jennifer Ranjani J.,
Jaclyn Xiao,
Gunvant Chaudhari,
Akanksha Rajput,
Praveen Govindan,
Christian Canham,
Wei Chen,
Minami Yamaura,
Laura Gomezjurado,
Aaron Broukhim,
Amil Khanzada,
Mert Pilanci
Abstract:
The Covid-19 pandemic has been one of the most devastating events in recent history, claiming the lives of more than 5 million people worldwide. Even with the worldwide distribution of vaccines, there is an apparent need for affordable, reliable, and accessible screening techniques to serve parts of the World that do not have access to Western medicine. Artificial Intelligence can provide a soluti…
▽ More
The Covid-19 pandemic has been one of the most devastating events in recent history, claiming the lives of more than 5 million people worldwide. Even with the worldwide distribution of vaccines, there is an apparent need for affordable, reliable, and accessible screening techniques to serve parts of the World that do not have access to Western medicine. Artificial Intelligence can provide a solution utilizing cough sounds as a primary screening mode for COVID-19 diagnosis. This paper presents multiple models that have achieved relatively respectable performance on the largest evaluation dataset currently presented in academic literature. Through investigation of a self-supervised learning model (Area under the ROC curve, AUC = 0.807) and a convolutional nerual network (CNN) model (AUC = 0.802), we observe the possibility of model bias with limited datasets. Moreover, we observe that performance increases with training data size, showing the need for the worldwide collection of data to help combat the Covid-19 pandemic with non-traditional means.
△ Less
Submitted 29 March, 2022; v1 submitted 5 January, 2022;
originally announced January 2022.
-
Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks
Authors:
Tolga Ergen,
Mert Pilanci
Abstract:
Understanding the fundamental principles behind the success of deep neural networks is one of the most important open questions in the current literature. To this end, we study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape. We consider a deep parallel ReLU network architecture, which also includes standard d…
▽ More
Understanding the fundamental principles behind the success of deep neural networks is one of the most important open questions in the current literature. To this end, we study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape. We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases. We then show that pathwise regularized training problems can be represented as an exact convex optimization problem. We further prove that the equivalent convex problem is regularized via a group sparsity inducing norm. Thus, a path regularized parallel ReLU network can be viewed as a parsimonious convex model in high dimensions. More importantly, since the original training problem may not be trainable in polynomial-time, we propose an approximate algorithm with a fully polynomial-time complexity in all data dimensions. Then, we prove strong global optimality guarantees for this algorithm. We also provide experiments corroborating our theory.
△ Less
Submitted 25 September, 2023; v1 submitted 18 October, 2021;
originally announced October 2021.
-
The Convex Geometry of Backpropagation: Neural Network Gradient Flows Converge to Extreme Points of the Dual Convex Program
Authors:
Yifei Wang,
Mert Pilanci
Abstract:
We study non-convex subgradient flows for training two-layer ReLU neural networks from a convex geometry and duality perspective. We characterize the implicit bias of unregularized non-convex gradient flow as convex regularization of an equivalent convex model. We then show that the limit points of non-convex subgradient flows can be identified via primal-dual correspondence in this convex optimiz…
▽ More
We study non-convex subgradient flows for training two-layer ReLU neural networks from a convex geometry and duality perspective. We characterize the implicit bias of unregularized non-convex gradient flow as convex regularization of an equivalent convex model. We then show that the limit points of non-convex subgradient flows can be identified via primal-dual correspondence in this convex optimization problem. Moreover, we derive a sufficient condition on the dual variables which ensures that the stationary points of the non-convex objective are the KKT points of the convex objective, thus proving convergence of non-convex gradient flows to the global optimum. For a class of regular training data distributions such as orthogonal separable data, we show that this sufficient condition holds. Therefore, non-convex gradient flows in fact converge to optimal solutions of a convex optimization problem. We present numerical results verifying the predictions of our theory for non-convex subgradient descent.
△ Less
Submitted 13 October, 2021;
originally announced October 2021.
-
Parallel Deep Neural Networks Have Zero Duality Gap
Authors:
Yifei Wang,
Tolga Ergen,
Mert Pilanci
Abstract:
Training deep neural networks is a challenging non-convex optimization problem. Recent work has proven that the strong duality holds (which means zero duality gap) for regularized finite-width two-layer ReLU networks and consequently provided an equivalent convex training problem. However, extending this result to deeper networks remains to be an open problem. In this paper, we prove that the dual…
▽ More
Training deep neural networks is a challenging non-convex optimization problem. Recent work has proven that the strong duality holds (which means zero duality gap) for regularized finite-width two-layer ReLU networks and consequently provided an equivalent convex training problem. However, extending this result to deeper networks remains to be an open problem. In this paper, we prove that the duality gap for deeper linear networks with vector outputs is non-zero. In contrast, we show that the zero duality gap can be obtained by stacking standard deep networks in parallel, which we call a parallel architecture, and modifying the regularization. Therefore, we prove the strong duality and existence of equivalent convex problems that enable globally optimal training of deep networks. As a by-product of our analysis, we demonstrate that the weight decay regularization on the network parameters explicitly encourages low-rank solutions via closed-form expressions. In addition, we show that strong duality holds for three-layer standard ReLU networks given rank-1 data matrices.
△ Less
Submitted 6 March, 2023; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs
Authors:
Tolga Ergen,
Mert Pilanci
Abstract:
Understanding the fundamental mechanism behind the success of deep neural networks is one of the key challenges in the modern machine learning literature. Despite numerous attempts, a solid theoretical analysis is yet to be developed. In this paper, we develop a novel unified framework to reveal a hidden regularization mechanism through the lens of convex optimization. We first show that the train…
▽ More
Understanding the fundamental mechanism behind the success of deep neural networks is one of the key challenges in the modern machine learning literature. Despite numerous attempts, a solid theoretical analysis is yet to be developed. In this paper, we develop a novel unified framework to reveal a hidden regularization mechanism through the lens of convex optimization. We first show that the training of multiple three-layer ReLU sub-networks with weight decay regularization can be equivalently cast as a convex optimization problem in a higher dimensional space, where sparsity is enforced via a group $\ell_1$-norm regularization. Consequently, ReLU networks can be interpreted as high dimensional feature selection methods. More importantly, we then prove that the equivalent convex problem can be globally optimized by a standard convex optimization solver with a polynomial-time complexity with respect to the number of samples and data dimension when the width of the network is fixed. Finally, we numerically validate our theoretical results via experiments involving both synthetic and real datasets.
△ Less
Submitted 12 January, 2022; v1 submitted 11 October, 2021;
originally announced October 2021.
-
Computational Polarization: An Information-theoretic Method for Resilient Computing
Authors:
Mert Pilanci
Abstract:
We introduce an error resilient distributed computing method based on an extension of the channel polarization phenomenon to distributed algorithms. The method leverages an algorithmic split operation that transforms two identical compute nodes to slow and fast workers, which parallels the channel split operation in Polar Codes. This operation preserves the average runtime, analogous to the conser…
▽ More
We introduce an error resilient distributed computing method based on an extension of the channel polarization phenomenon to distributed algorithms. The method leverages an algorithmic split operation that transforms two identical compute nodes to slow and fast workers, which parallels the channel split operation in Polar Codes. This operation preserves the average runtime, analogous to the conservation of Shannon capacity in channel polarization. By leveraging a recursive construction in a similar spirit to the Fast Fourier Transform, this method synthesizes virtual compute nodes with dispersed return time distributions, which we call computational polarization. We show that the runtime distributions form a functional martingale processes, identify their limiting distributions in closed-form expressions together with non-asymptotic convergence rates, and prove strong convergence results in Banach spaces. We provide an information-theoretic lower bound on the overall runtime of any coded computation method and show that the computational polarization approach asymptotically achieves the optimal runtime for computing linear functions. An important advantage is the near linear time decoding procedure, which is significantly cheaper than Maximum Distance Separable codes.
△ Less
Submitted 8 September, 2021;
originally announced September 2021.
-
Newton-LESS: Sparsification without Trade-offs for the Sketched Newton Update
Authors:
Michał Dereziński,
Jonathan Lacotte,
Mert Pilanci,
Michael W. Mahoney
Abstract:
In second-order optimization, a potential bottleneck can be computing the Hessian matrix of the optimized function at every iteration. Randomized sketching has emerged as a powerful technique for constructing estimates of the Hessian which can be used to perform approximate Newton steps. This involves multiplication by a random sketching matrix, which introduces a trade-off between the computation…
▽ More
In second-order optimization, a potential bottleneck can be computing the Hessian matrix of the optimized function at every iteration. Randomized sketching has emerged as a powerful technique for constructing estimates of the Hessian which can be used to perform approximate Newton steps. This involves multiplication by a random sketching matrix, which introduces a trade-off between the computational cost of sketching and the convergence rate of the optimization algorithm. A theoretically desirable but practically much too expensive choice is to use a dense Gaussian sketching matrix, which produces unbiased estimates of the exact Newton step and which offers strong problem-independent convergence guarantees. We show that the Gaussian sketching matrix can be drastically sparsified, significantly reducing the computational cost of sketching, without substantially affecting its convergence properties. This approach, called Newton-LESS, is based on a recently introduced sketching technique: LEverage Score Sparsified (LESS) embeddings. We prove that Newton-LESS enjoys nearly the same problem-independent local convergence rate as Gaussian embeddings, not just up to constant factors but even down to lower order terms, for a large class of optimization tasks. In particular, this leads to a new state-of-the-art convergence result for an iterative least squares solver. Finally, we extend LESS embeddings to include uniformly sparsified random sign matrices which can be implemented efficiently and which perform well in numerical experiments.
△ Less
Submitted 15 July, 2021;
originally announced July 2021.
-
Hidden Convexity of Wasserstein GANs: Interpretable Generative Models with Closed-Form Solutions
Authors:
Arda Sahiner,
Tolga Ergen,
Batu Ozturkler,
Burak Bartan,
John Pauly,
Morteza Mardani,
Mert Pilanci
Abstract:
Generative Adversarial Networks (GANs) are commonly used for modeling complex distributions of data. Both the generators and discriminators of GANs are often modeled by neural networks, posing a non-transparent optimization problem which is non-convex and non-concave over the generator and discriminator, respectively. Such networks are often heuristically optimized with gradient descent-ascent (GD…
▽ More
Generative Adversarial Networks (GANs) are commonly used for modeling complex distributions of data. Both the generators and discriminators of GANs are often modeled by neural networks, posing a non-transparent optimization problem which is non-convex and non-concave over the generator and discriminator, respectively. Such networks are often heuristically optimized with gradient descent-ascent (GDA), but it is unclear whether the optimization problem contains any saddle points, or whether heuristic methods can find them in practice. In this work, we analyze the training of Wasserstein GANs with two-layer neural network discriminators through the lens of convex duality, and for a variety of generators expose the conditions under which Wasserstein GANs can be solved exactly with convex optimization approaches, or can be represented as convex-concave games. Using this convex duality interpretation, we further demonstrate the impact of different activation functions of the discriminator. Our observations are verified with numerical results demonstrating the power of the convex interpretation, with applications in progressive training of convex architectures corresponding to linear generators and quadratic-activation discriminators for CelebA image generation. The code for our experiments is available at https://github.com/ardasahiner/ProCoGAN.
△ Less
Submitted 21 March, 2022; v1 submitted 12 July, 2021;
originally announced July 2021.
-
Adaptive Newton Sketch: Linear-time Optimization with Quadratic Convergence and Effective Hessian Dimensionality
Authors:
Jonathan Lacotte,
Yifei Wang,
Mert Pilanci
Abstract:
We propose a randomized algorithm with quadratic convergence rate for convex optimization problems with a self-concordant, composite, strongly convex objective function. Our method is based on performing an approximate Newton step using a random projection of the Hessian. Our first contribution is to show that, at each iteration, the embedding dimension (or sketch size) can be as small as the effe…
▽ More
We propose a randomized algorithm with quadratic convergence rate for convex optimization problems with a self-concordant, composite, strongly convex objective function. Our method is based on performing an approximate Newton step using a random projection of the Hessian. Our first contribution is to show that, at each iteration, the embedding dimension (or sketch size) can be as small as the effective dimension of the Hessian matrix. Leveraging this novel fundamental result, we design an algorithm with a sketch size proportional to the effective dimension and which exhibits a quadratic rate of convergence. This result dramatically improves on the classical linear-quadratic convergence rates of state-of-the-art sub-sampled Newton methods. However, in most practical cases, the effective dimension is not known beforehand, and this raises the question of how to pick a sketch size as small as the effective dimension while preserving a quadratic convergence rate. Our second and main contribution is thus to propose an adaptive sketch size algorithm with quadratic convergence rate and which does not require prior knowledge or estimation of the effective dimension: at each iteration, it starts with a small sketch size, and increases it until quadratic progress is achieved. Importantly, we show that the embedding dimension remains proportional to the effective dimension throughout the entire path and that our method achieves state-of-the-art computational complexity for solving convex optimization programs with a strongly convex component.
△ Less
Submitted 15 May, 2021;
originally announced May 2021.
-
Training Quantized Neural Networks to Global Optimality via Semidefinite Programming
Authors:
Burak Bartan,
Mert Pilanci
Abstract:
Neural networks (NNs) have been extremely successful across many tasks in machine learning. Quantization of NN weights has become an important topic due to its impact on their energy efficiency, inference time and deployment on hardware. Although post-training quantization is well-studied, training optimal quantized NNs involves combinatorial non-convex optimization problems which appear intractab…
▽ More
Neural networks (NNs) have been extremely successful across many tasks in machine learning. Quantization of NN weights has become an important topic due to its impact on their energy efficiency, inference time and deployment on hardware. Although post-training quantization is well-studied, training optimal quantized NNs involves combinatorial non-convex optimization problems which appear intractable. In this work, we introduce a convex optimization strategy to train quantized NNs with polynomial activations. Our method leverages hidden convexity in two-layer neural networks from the recent literature, semidefinite lifting, and Grothendieck's identity. Surprisingly, we show that certain quantized NN problems can be solved to global optimality in polynomial-time in all relevant parameters via semidefinite relaxations. We present numerical examples to illustrate the effectiveness of our method.
△ Less
Submitted 5 May, 2021; v1 submitted 4 May, 2021;
originally announced May 2021.
-
Fast Convex Quadratic Optimization Solvers with Adaptive Sketching-based Preconditioners
Authors:
Jonathan Lacotte,
Mert Pilanci
Abstract:
We consider least-squares problems with quadratic regularization and propose novel sketching-based iterative methods with an adaptive sketch size. The sketch size can be as small as the effective dimension of the data matrix to guarantee linear convergence. However, a major difficulty in choosing the sketch size in terms of the effective dimension lies in the fact that the latter is usually unknow…
▽ More
We consider least-squares problems with quadratic regularization and propose novel sketching-based iterative methods with an adaptive sketch size. The sketch size can be as small as the effective dimension of the data matrix to guarantee linear convergence. However, a major difficulty in choosing the sketch size in terms of the effective dimension lies in the fact that the latter is usually unknown in practice. Current sketching-based solvers for regularized least-squares fall short on addressing this issue. Our main contribution is to propose adaptive versions of standard sketching-based iterative solvers, namely, the iterative Hessian sketch and the preconditioned conjugate gradient method, that do not require a priori estimation of the effective dimension. We propose an adaptive mechanism to control the sketch size according to the progress made in each step of the iterative solver. If enough progress is not made, the sketch size increases to improve the convergence rate. We prove that the adaptive sketch size scales at most in terms of the effective dimension, and that our adaptive methods are guaranteed to converge linearly. Consequently, our adaptive methods improve the state-of-the-art complexity for solving dense, ill-conditioned least-squares problems. Importantly, we illustrate numerically on several synthetic and real datasets that our method is extremely efficient and is often significantly faster than standard least-squares solvers such as a direct factorization based solver, the conjugate gradient method and its preconditioned variants.
△ Less
Submitted 29 April, 2021;
originally announced April 2021.