Search | arXiv e-print repository

Hidden Convexity of Wasserstein GANs: Interpretable Generative Models with Closed-Form Solutions

Authors: Arda Sahiner, Tolga Ergen, Batu Ozturkler, Burak Bartan, John Pauly, Morteza Mardani, Mert Pilanci

Abstract: Generative Adversarial Networks (GANs) are commonly used for modeling complex distributions of data. Both the generators and discriminators of GANs are often modeled by neural networks, posing a non-transparent optimization problem which is non-convex and non-concave over the generator and discriminator, respectively. Such networks are often heuristically optimized with gradient descent-ascent (GD… ▽ More Generative Adversarial Networks (GANs) are commonly used for modeling complex distributions of data. Both the generators and discriminators of GANs are often modeled by neural networks, posing a non-transparent optimization problem which is non-convex and non-concave over the generator and discriminator, respectively. Such networks are often heuristically optimized with gradient descent-ascent (GDA), but it is unclear whether the optimization problem contains any saddle points, or whether heuristic methods can find them in practice. In this work, we analyze the training of Wasserstein GANs with two-layer neural network discriminators through the lens of convex duality, and for a variety of generators expose the conditions under which Wasserstein GANs can be solved exactly with convex optimization approaches, or can be represented as convex-concave games. Using this convex duality interpretation, we further demonstrate the impact of different activation functions of the discriminator. Our observations are verified with numerical results demonstrating the power of the convex interpretation, with applications in progressive training of convex architectures corresponding to linear generators and quadratic-activation discriminators for CelebA image generation. The code for our experiments is available at https://github.com/ardasahiner/ProCoGAN. △ Less

Submitted 21 March, 2022; v1 submitted 12 July, 2021; originally announced July 2021.

Comments: Published as paper in ICLR 2022. First two authors contributed equally to this work; 34 pages, 11 figures

arXiv:2105.07291 [pdf, other]

Adaptive Newton Sketch: Linear-time Optimization with Quadratic Convergence and Effective Hessian Dimensionality

Authors: Jonathan Lacotte, Yifei Wang, Mert Pilanci

Abstract: We propose a randomized algorithm with quadratic convergence rate for convex optimization problems with a self-concordant, composite, strongly convex objective function. Our method is based on performing an approximate Newton step using a random projection of the Hessian. Our first contribution is to show that, at each iteration, the embedding dimension (or sketch size) can be as small as the effe… ▽ More We propose a randomized algorithm with quadratic convergence rate for convex optimization problems with a self-concordant, composite, strongly convex objective function. Our method is based on performing an approximate Newton step using a random projection of the Hessian. Our first contribution is to show that, at each iteration, the embedding dimension (or sketch size) can be as small as the effective dimension of the Hessian matrix. Leveraging this novel fundamental result, we design an algorithm with a sketch size proportional to the effective dimension and which exhibits a quadratic rate of convergence. This result dramatically improves on the classical linear-quadratic convergence rates of state-of-the-art sub-sampled Newton methods. However, in most practical cases, the effective dimension is not known beforehand, and this raises the question of how to pick a sketch size as small as the effective dimension while preserving a quadratic convergence rate. Our second and main contribution is thus to propose an adaptive sketch size algorithm with quadratic convergence rate and which does not require prior knowledge or estimation of the effective dimension: at each iteration, it starts with a small sketch size, and increases it until quadratic progress is achieved. Importantly, we show that the embedding dimension remains proportional to the effective dimension throughout the entire path and that our method achieves state-of-the-art computational complexity for solving convex optimization programs with a strongly convex component. △ Less

Submitted 15 May, 2021; originally announced May 2021.

arXiv:2105.01420 [pdf, ps, other]

Training Quantized Neural Networks to Global Optimality via Semidefinite Programming

Authors: Burak Bartan, Mert Pilanci

Abstract: Neural networks (NNs) have been extremely successful across many tasks in machine learning. Quantization of NN weights has become an important topic due to its impact on their energy efficiency, inference time and deployment on hardware. Although post-training quantization is well-studied, training optimal quantized NNs involves combinatorial non-convex optimization problems which appear intractab… ▽ More Neural networks (NNs) have been extremely successful across many tasks in machine learning. Quantization of NN weights has become an important topic due to its impact on their energy efficiency, inference time and deployment on hardware. Although post-training quantization is well-studied, training optimal quantized NNs involves combinatorial non-convex optimization problems which appear intractable. In this work, we introduce a convex optimization strategy to train quantized NNs with polynomial activations. Our method leverages hidden convexity in two-layer neural networks from the recent literature, semidefinite lifting, and Grothendieck's identity. Surprisingly, we show that certain quantized NN problems can be solved to global optimality in polynomial-time in all relevant parameters via semidefinite relaxations. We present numerical examples to illustrate the effectiveness of our method. △ Less

Submitted 5 May, 2021; v1 submitted 4 May, 2021; originally announced May 2021.

Comments: v2: Minor edits in the text. The results are unchanged

arXiv:2104.14101 [pdf, other]

Fast Convex Quadratic Optimization Solvers with Adaptive Sketching-based Preconditioners

Authors: Jonathan Lacotte, Mert Pilanci

Abstract: We consider least-squares problems with quadratic regularization and propose novel sketching-based iterative methods with an adaptive sketch size. The sketch size can be as small as the effective dimension of the data matrix to guarantee linear convergence. However, a major difficulty in choosing the sketch size in terms of the effective dimension lies in the fact that the latter is usually unknow… ▽ More We consider least-squares problems with quadratic regularization and propose novel sketching-based iterative methods with an adaptive sketch size. The sketch size can be as small as the effective dimension of the data matrix to guarantee linear convergence. However, a major difficulty in choosing the sketch size in terms of the effective dimension lies in the fact that the latter is usually unknown in practice. Current sketching-based solvers for regularized least-squares fall short on addressing this issue. Our main contribution is to propose adaptive versions of standard sketching-based iterative solvers, namely, the iterative Hessian sketch and the preconditioned conjugate gradient method, that do not require a priori estimation of the effective dimension. We propose an adaptive mechanism to control the sketch size according to the progress made in each step of the iterative solver. If enough progress is not made, the sketch size increases to improve the convergence rate. We prove that the adaptive sketch size scales at most in terms of the effective dimension, and that our adaptive methods are guaranteed to converge linearly. Consequently, our adaptive methods improve the state-of-the-art complexity for solving dense, ill-conditioned least-squares problems. Importantly, we illustrate numerically on several synthetic and real datasets that our method is extremely efficient and is often significantly faster than standard least-squares solvers such as a direct factorization based solver, the conjugate gradient method and its preconditioned variants. △ Less

Submitted 29 April, 2021; originally announced April 2021.

arXiv:2103.07578 [pdf, other]

Efficient Randomized Subspace Embeddings for Distributed Optimization under a Communication Budget

Authors: Rajarshi Saha, Mert Pilanci, Andrea J. Goldsmith

Abstract: We study first-order optimization algorithms under the constraint that the descent direction is quantized using a pre-specified budget of $R$-bits per dimension, where $R \in (0 ,\infty)$. We propose computationally efficient optimization algorithms with convergence rates matching the information-theoretic performance lower bounds for: (i) Smooth and Strongly-Convex objectives with access to an Ex… ▽ More We study first-order optimization algorithms under the constraint that the descent direction is quantized using a pre-specified budget of $R$-bits per dimension, where $R \in (0 ,\infty)$. We propose computationally efficient optimization algorithms with convergence rates matching the information-theoretic performance lower bounds for: (i) Smooth and Strongly-Convex objectives with access to an Exact Gradient oracle, as well as (ii) General Convex and Non-Smooth objectives with access to a Noisy Subgradient oracle. The crux of these algorithms is a polynomial complexity source coding scheme that embeds a vector into a random subspace before quantizing it. These embeddings are such that with high probability, their projection along any of the canonical directions of the transform space is small. As a consequence, quantizing these embeddings followed by an inverse transform to the original space yields a source coding method with optimal covering efficiency while utilizing just $R$-bits per dimension. Our algorithms guarantee optimality for arbitrary values of the bit-budget $R$, which includes both the sub-linear budget regime ($R < 1$), as well as the high-budget regime ($R \geq 1$), while requiring $O\left(n^2\right)$ multiplications, where $n$ is the dimension. We also propose an efficient relaxation of this coding scheme using Hadamard subspaces that requires a near-linear time, i.e., $O\left(n \log n\right)$ additions.Furthermore, we show that the utility of our proposed embeddings can be extended to significantly improve the performance of gradient sparsification schemes. Numerical simulations validate our theoretical claims. Our implementations are available at https://github.com/rajarshisaha95/DistOptConstrComm. △ Less

Submitted 15 August, 2022; v1 submitted 12 March, 2021; originally announced March 2021.

Comments: 41 pages, 26 figures, 1 table. This work has been accepted for publication in the IEEE Journal on Selected Areas in Information Theory (JSAIT), Spl. issue on Distributed Coding and Computation

arXiv:2103.01499 [pdf, other]

Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization

Authors: Tolga Ergen, Arda Sahiner, Batu Ozturkler, John Pauly, Morteza Mardani, Mert Pilanci

Abstract: Batch Normalization (BN) is a commonly used technique to accelerate and stabilize training of deep neural networks. Despite its empirical success, a full theoretical understanding of BN is yet to be developed. In this work, we analyze BN through the lens of convex optimization. We introduce an analytic framework based on convex duality to obtain exact convex representations of weight-decay regular… ▽ More Batch Normalization (BN) is a commonly used technique to accelerate and stabilize training of deep neural networks. Despite its empirical success, a full theoretical understanding of BN is yet to be developed. In this work, we analyze BN through the lens of convex optimization. We introduce an analytic framework based on convex duality to obtain exact convex representations of weight-decay regularized ReLU networks with BN, which can be trained in polynomial-time. Our analyses also show that optimal layer weights can be obtained as simple closed-form formulas in the high-dimensional and/or overparameterized regimes. Furthermore, we find that Gradient Descent provides an algorithmic bias effect on the standard non-convex BN network, and we design an approach to explicitly encode this implicit regularization into the convex objective. Experiments with CIFAR image classification highlight the effectiveness of this explicit regularization for mimicking and substantially improving the performance of standard BN networks. △ Less

Submitted 21 March, 2022; v1 submitted 2 March, 2021; originally announced March 2021.

Comments: Accepted to ICLR 2022. First two authors contributed equally to this work; 36 pages, 13 figures

arXiv:2102.03088 [pdf, other]

Boost AI Power: Data Augmentation Strategies with unlabelled Data and Conformal Prediction, a Case in Alternative Herbal Medicine Discrimination with Electronic Nose

Authors: Li Liu, Xianghao Zhan, Rumeng Wu, Xiaoqing Guan, Zhan Wang, Wei Zhang, Mert Pilanci, You Wang, Zhiyuan Luo, Guang Li

Abstract: Electronic nose has been proven to be effective in alternative herbal medicine classification, but due to the nature of supervised learning, previous research heavily relies on the labelled training data, which are time-costly and labor-intensive to collect. To alleviate the critical dependency on the training data in real-world applications, this study aims to improve classification accuracy via… ▽ More Electronic nose has been proven to be effective in alternative herbal medicine classification, but due to the nature of supervised learning, previous research heavily relies on the labelled training data, which are time-costly and labor-intensive to collect. To alleviate the critical dependency on the training data in real-world applications, this study aims to improve classification accuracy via data augmentation strategies. The effectiveness of five data augmentation strategies under different training data inadequacy are investigated in two scenarios: the noise-free scenario where different availabilities of unlabelled data were considered, and the noisy scenario where different levels of Gaussian noises and translational shifts were added to represent sensor drifts. The five augmentation strategies, namely noise-adding data augmentation, semi-supervised learning, classifier-based online learning, Inductive Conformal Prediction (ICP) online learning and our novel ensemble ICP online learning proposed in this study, are experimented and compared against supervised learning baseline, with Linear Discriminant Analysis (LDA) and Support Vector Machine (SVM) as the classifiers. Our novel strategy, ensemble ICP online learning, outperforms the others by showing non-decreasing classification accuracy on all tasks and a significant improvement on most simulated tasks (25out of 36 tasks,p<=0.05). Furthermore, this study provides a systematic analysis of different augmentation strategies. It shows at least one strategy significantly improved the classification accuracy with LDA (p<=0.05) and non-decreasing classification accuracy with SVM in each task. In particular, our proposed strategy demonstrated both effectiveness and robustness in boosting the classification model generalizability, which can be employed in other machine learning applications. △ Less

Submitted 17 July, 2021; v1 submitted 5 February, 2021; originally announced February 2021.

arXiv:2101.02429 [pdf, other]

Neural Spectrahedra and Semidefinite Lifts: Global Convex Optimization of Polynomial Activation Neural Networks in Fully Polynomial-Time

Authors: Burak Bartan, Mert Pilanci

Abstract: The training of two-layer neural networks with nonlinear activation functions is an important non-convex optimization problem with numerous applications and promising performance in layerwise deep learning. In this paper, we develop exact convex optimization formulations for two-layer neural networks with second degree polynomial activations based on semidefinite programming. Remarkably, we show t… ▽ More The training of two-layer neural networks with nonlinear activation functions is an important non-convex optimization problem with numerous applications and promising performance in layerwise deep learning. In this paper, we develop exact convex optimization formulations for two-layer neural networks with second degree polynomial activations based on semidefinite programming. Remarkably, we show that semidefinite lifting is always exact and therefore computational complexity for global optimization is polynomial in the input dimension and sample size for all input data. The developed convex formulations are proven to achieve the same global optimal solution set as their non-convex counterparts. More specifically, the globally optimal two-layer neural network with polynomial activations can be found by solving a semidefinite program (SDP) and decomposing the solution using a procedure we call Neural Decomposition. Moreover, the choice of regularizers plays a crucial role in the computational tractability of neural network training. We show that the standard weight decay regularization formulation is NP-hard, whereas other simple convex penalties render the problem tractable in polynomial time via convex programming. We extend the results beyond the fully connected architecture to different neural network architectures including networks with vector outputs and convolutional architectures with pooling. We provide extensive numerical simulations showing that the standard backpropagation approach often fails to achieve the global optimum of the training loss. The proposed approach is significantly faster to obtain better test accuracy compared to the standard backpropagation procedure. △ Less

Submitted 7 January, 2021; originally announced January 2021.

arXiv:2012.13329 [pdf, other]

Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms

Authors: Arda Sahiner, Tolga Ergen, John Pauly, Mert Pilanci

Abstract: We describe the convex semi-infinite dual of the two-layer vector-output ReLU neural network training problem. This semi-infinite dual admits a finite dimensional representation, but its support is over a convex set which is difficult to characterize. In particular, we demonstrate that the non-convex neural network training problem is equivalent to a finite-dimensional convex copositive program. O… ▽ More We describe the convex semi-infinite dual of the two-layer vector-output ReLU neural network training problem. This semi-infinite dual admits a finite dimensional representation, but its support is over a convex set which is difficult to characterize. In particular, we demonstrate that the non-convex neural network training problem is equivalent to a finite-dimensional convex copositive program. Our work is the first to identify this strong connection between the global optima of neural networks and those of copositive programs. We thus demonstrate how neural networks implicitly attempt to solve copositive programs via semi-nonnegative matrix factorization, and draw key insights from this formulation. We describe the first algorithms for provably finding the global minimum of the vector output neural network training problem, which are polynomial in the number of samples for a fixed data rank, yet exponential in the dimension. However, in the case of convolutional architectures, the computational complexity is exponential in only the filter size and polynomial in all other parameters. We describe the circumstances in which we can find the global optimum of this neural network training problem exactly with soft-thresholded SVD, and provide a copositive relaxation which is guaranteed to be exact for certain classes of problems, and which corresponds with the solution of Stochastic Gradient Descent in practice. △ Less

Submitted 20 December, 2021; v1 submitted 24 December, 2020; originally announced December 2020.

Comments: 25 pages, 6 figures

arXiv:2012.07054 [pdf, other]

Adaptive and Oblivious Randomized Subspace Methods for High-Dimensional Optimization: Sharp Analysis and Lower Bounds

Authors: Jonathan Lacotte, Mert Pilanci

Abstract: We propose novel randomized optimization methods for high-dimensional convex problems based on restrictions of variables to random subspaces. We consider oblivious and data-adaptive subspaces and study their approximation properties via convex duality and Fenchel conjugates. A suitable adaptive subspace can be generated by sampling a correlated random matrix whose second order statistics mirror th… ▽ More We propose novel randomized optimization methods for high-dimensional convex problems based on restrictions of variables to random subspaces. We consider oblivious and data-adaptive subspaces and study their approximation properties via convex duality and Fenchel conjugates. A suitable adaptive subspace can be generated by sampling a correlated random matrix whose second order statistics mirror the input data. We illustrate that the adaptive strategy can significantly outperform the standard oblivious sampling method, which is widely used in the recent literature. We show that the relative error of the randomized approximations can be tightly characterized in terms of the spectrum of the data matrix and Gaussian width of the dual tangent cone at optimum. We develop lower bounds for both optimization and statistical error measures based on concentration of measure and Fano's inequality. We then present the consequences of our theory with data matrices of varying spectral decay profiles. Experimental results show that the proposed approach enables significant speed ups in a wide variety of machine learning and optimization problems including logistic regression, kernel classification with random convolution layers and shallow neural networks with rectified linear units. △ Less

Submitted 13 December, 2020; originally announced December 2020.

arXiv:2012.05169 [pdf, other]

Convex Regularization Behind Neural Reconstruction

Authors: Arda Sahiner, Morteza Mardani, Batu Ozturkler, Mert Pilanci, John Pauly

Abstract: Neural networks have shown tremendous potential for reconstructing high-resolution images in inverse problems. The non-convex and opaque nature of neural networks, however, hinders their utility in sensitive applications such as medical imaging. To cope with this challenge, this paper advocates a convex duality framework that makes a two-layer fully-convolutional ReLU denoising network amenable to… ▽ More Neural networks have shown tremendous potential for reconstructing high-resolution images in inverse problems. The non-convex and opaque nature of neural networks, however, hinders their utility in sensitive applications such as medical imaging. To cope with this challenge, this paper advocates a convex duality framework that makes a two-layer fully-convolutional ReLU denoising network amenable to convex optimization. The convex dual network not only offers the optimum training with convex solvers, but also facilitates interpreting training and prediction. In particular, it implies training neural networks with weight decay regularization induces path sparsity while the prediction is piecewise linear filtering. A range of experiments with MNIST and fastMRI datasets confirm the efficacy of the dual network optimization problem. △ Less

Submitted 9 December, 2020; originally announced December 2020.

arXiv:2011.09709 [pdf, other]

Approximate Weighted $CR$ Coded Matrix Multiplication

Authors: Neophytos Charalambides, Mert Pilanci, Alfred Hero

Abstract: One of the most common, but at the same time expensive operations in linear algebra, is multiplying two matrices $A$ and $B$. With the rapid development of machine learning and increases in data volume, performing fast matrix intensive multiplications has become a major hurdle. Two different approaches to overcoming this issue are, 1) to approximate the product; and 2) to perform the multiplicatio… ▽ More One of the most common, but at the same time expensive operations in linear algebra, is multiplying two matrices $A$ and $B$. With the rapid development of machine learning and increases in data volume, performing fast matrix intensive multiplications has become a major hurdle. Two different approaches to overcoming this issue are, 1) to approximate the product; and 2) to perform the multiplication distributively. A \textit{$CR$-multiplication} is an approximation where columns and rows of $A$ and $B$ are respectively sampled with replacement. In the distributed setting, multiple workers perform matrix multiplication subtasks in parallel. Some of the workers may be stragglers, meaning they do not complete their task in time. We present a novel \textit{approximate weighted $CR$ coded matrix multiplication} scheme, that achieves improved performance for distributed matrix multiplication. △ Less

Submitted 19 November, 2020; originally announced November 2020.

Comments: 5 pages, 1 figure, conference

MSC Class: 65Fxx; 65Dxx ACM Class: E.4; H.1.1

arXiv:2010.13836 [pdf, other]

doi 10.1109/EMBC46164.2021.9630217

Linear Predictive Coding for Acute Stress Prediction from Computer Mouse Movements

Authors: Lawrence H. Kim, Rahul Goel, Jia Liang, Mert Pilanci, Pablo E. Paredes

Abstract: Prior work demonstrated the potential of using the Linear Predictive Coding (LPC) filter to approximate muscle stiffness and dam** from computer mouse movements to predict acute stress levels of users. Theoretically, muscle stiffness and dam** in the arm can be estimated using a mass-spring-damper (MSD) biomechanical model. However, the dam** frequency (i.e., stiffness) and dam** ratio val… ▽ More Prior work demonstrated the potential of using the Linear Predictive Coding (LPC) filter to approximate muscle stiffness and dam** from computer mouse movements to predict acute stress levels of users. Theoretically, muscle stiffness and dam** in the arm can be estimated using a mass-spring-damper (MSD) biomechanical model. However, the dam** frequency (i.e., stiffness) and dam** ratio values derived using LPC were not yet compared with those from a theoretical MSD model. This work demonstrates that the dam** frequency and dam** ratio from LPC are significantly correlated with those from an MSD model, thus confirming the validity of using LPC to infer muscle stiffness and dam**. We also compare the stress level binary classification performance using the values from LPC and MSD with each other and with neural network-based baselines. We found comparable performance across all conditions demonstrating LPC and MSD model-based stress prediction efficacy, especially for longer mouse trajectories. Clinical relevance: This work demonstrates the validity of the LPC filter to approximate muscle stiffness and dam** and predict acute stress from computer mouse movements. △ Less

Submitted 15 December, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

Comments: The first three authors contributed equally. 5 pages, 6 figures, 2 tables, published at EMBC'21

Journal ref: 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)

arXiv:2007.01327 [pdf, other]

Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization

Authors: Michał Dereziński, Burak Bartan, Mert Pilanci, Michael W. Mahoney

Abstract: In distributed second order optimization, a standard strategy is to average many local estimates, each of which is based on a small sketch or batch of the data. However, the local estimates on each machine are typically biased, relative to the full solution on all of the data, and this can limit the effectiveness of averaging. Here, we introduce a new technique for debiasing the local estimates, w… ▽ More In distributed second order optimization, a standard strategy is to average many local estimates, each of which is based on a small sketch or batch of the data. However, the local estimates on each machine are typically biased, relative to the full solution on all of the data, and this can limit the effectiveness of averaging. Here, we introduce a new technique for debiasing the local estimates, which leads to both theoretical and empirical improvements in the convergence rate of distributed second order methods. Our technique has two novel components: (1) modifying standard sketching techniques to obtain what we call a surrogate sketch; and (2) carefully scaling the global regularization parameter for local computations. Our surrogate sketches are based on determinantal point processes, a family of distributions for which the bias of an estimate of the inverse Hessian can be computed exactly. Based on this computation, we show that when the objective being minimized is $l_2$-regularized with parameter $λ$ and individual machines are each given a sketch of size $m$, then to eliminate the bias, local estimates should be computed using a shrunk regularization parameter given by $λ^{\prime}=λ\cdot(1-\frac{d_λ}{m})$, where $d_λ$ is the $λ$-effective dimension of the Hessian (or, for quadratic problems, the data matrix). △ Less

Submitted 2 July, 2020; originally announced July 2020.

arXiv:2006.14798 [pdf, other]

Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time

Authors: Tolga Ergen, Mert Pilanci

Abstract: We study training of Convolutional Neural Networks (CNNs) with ReLU activations and introduce exact convex optimization formulations with a polynomial complexity with respect to the number of data samples, the number of neurons, and data dimension. More specifically, we develop a convex analytic framework utilizing semi-infinite duality to obtain equivalent convex optimization problems for several… ▽ More We study training of Convolutional Neural Networks (CNNs) with ReLU activations and introduce exact convex optimization formulations with a polynomial complexity with respect to the number of data samples, the number of neurons, and data dimension. More specifically, we develop a convex analytic framework utilizing semi-infinite duality to obtain equivalent convex optimization problems for several two- and three-layer CNN architectures. We first prove that two-layer CNNs can be globally optimized via an $\ell_2$ norm regularized convex program. We then show that multi-layer circular CNN training problems with a single ReLU layer are equivalent to an $\ell_1$ regularized convex program that encourages sparsity in the spectral domain. We also extend these results to three-layer CNNs with two ReLU layers. Furthermore, we present extensions of our approach to different pooling methods, which elucidates the implicit architectural bias as convex regularizers. △ Less

Submitted 18 March, 2021; v1 submitted 26 June, 2020; originally announced June 2020.

Comments: Accepted for Spotlight Presentation at ICLR 2021

Journal ref: International Conference on Learning Representations (ICLR), 2021

arXiv:2006.08160 [pdf, other]

Lower Bounds and a Near-Optimal Shrinkage Estimator for Least Squares using Random Projections

Authors: Srivatsan Sridhar, Mert Pilanci, Ayfer Özgür

Abstract: In this work, we consider the deterministic optimization using random projections as a statistical estimation problem, where the squared distance between the predictions from the estimator and the true solution is the error metric. In approximately solving a large scale least squares problem using Gaussian sketches, we show that the sketched solution has a conditional Gaussian distribution with th… ▽ More In this work, we consider the deterministic optimization using random projections as a statistical estimation problem, where the squared distance between the predictions from the estimator and the true solution is the error metric. In approximately solving a large scale least squares problem using Gaussian sketches, we show that the sketched solution has a conditional Gaussian distribution with the true solution as its mean. Firstly, tight worst case error lower bounds with explicit constants are derived for any estimator using the Gaussian sketch, and the classical sketching is shown to be the optimal unbiased estimator. For biased estimators, the lower bound also incorporates prior knowledge about the true solution. Secondly, we use the James-Stein estimator to derive an improved estimator for the least squares solution using the Gaussian sketch. An upper bound on the expected error of this estimator is derived, which is smaller than the error of the classical Gaussian sketch solution for any given data. The upper and lower bounds match when the SNR of the true solution is known to be small and the data matrix is well conditioned. Empirically, this estimator achieves smaller error on simulated and real datasets, and works for other common sketching methods as well. △ Less

Submitted 15 June, 2020; originally announced June 2020.

Comments: This work has been submitted to the IEEE Journal on Selected Areas in Information Theory (JSAIT) - Special Issue on Estimation and Inference, and is awaiting review. This document contains 37 pages and 14 figures

arXiv:2006.05900 [pdf, other]

The Hidden Convex Optimization Landscape of Two-Layer ReLU Neural Networks: an Exact Characterization of the Optimal Solutions

Authors: Yifei Wang, Jonathan Lacotte, Mert Pilanci

Abstract: We prove that finding all globally optimal two-layer ReLU neural networks can be performed by solving a convex optimization program with cone constraints. Our analysis is novel, characterizes all optimal solutions, and does not leverage duality-based analysis which was recently used to lift neural network training into convex spaces. Given the set of solutions of our convex optimization program, w… ▽ More We prove that finding all globally optimal two-layer ReLU neural networks can be performed by solving a convex optimization program with cone constraints. Our analysis is novel, characterizes all optimal solutions, and does not leverage duality-based analysis which was recently used to lift neural network training into convex spaces. Given the set of solutions of our convex optimization program, we show how to construct exactly the entire set of optimal neural networks. We provide a detailed characterization of this optimal set and its invariant transformations. As additional consequences of our convex perspective, (i) we establish that Clarke stationary points found by stochastic gradient descent correspond to the global optimum of a subsampled convex problem (ii) we provide a polynomial-time algorithm for checking if a neural network is a global minimum of the training loss (iii) we provide an explicit construction of a continuous path between any neural network and the global minimum of its sublevel set and (iv) characterize the minimal size of the hidden layer so that the neural network optimization landscape has no spurious valleys. Overall, we provide a rich framework for studying the landscape of neural network training loss through convexity. △ Less

Submitted 13 March, 2022; v1 submitted 10 June, 2020; originally announced June 2020.

arXiv:2006.05874 [pdf, other]

Effective Dimension Adaptive Sketching Methods for Faster Regularized Least-Squares Optimization

Authors: Jonathan Lacotte, Mert Pilanci

Abstract: We propose a new randomized algorithm for solving L2-regularized least-squares problems based on sketching. We consider two of the most popular random embeddings, namely, Gaussian embeddings and the Subsampled Randomized Hadamard Transform (SRHT). While current randomized solvers for least-squares optimization prescribe an embedding dimension at least greater than the data dimension, we show that… ▽ More We propose a new randomized algorithm for solving L2-regularized least-squares problems based on sketching. We consider two of the most popular random embeddings, namely, Gaussian embeddings and the Subsampled Randomized Hadamard Transform (SRHT). While current randomized solvers for least-squares optimization prescribe an embedding dimension at least greater than the data dimension, we show that the embedding dimension can be reduced to the effective dimension of the optimization problem, and still preserve high-probability convergence guarantees. In this regard, we derive sharp matrix deviation inequalities over ellipsoids for both Gaussian and SRHT embeddings. Specifically, we improve on the constant of a classical Gaussian concentration bound whereas, for SRHT embeddings, our deviation inequality involves a novel technical approach. Leveraging these bounds, we are able to design a practical and adaptive algorithm which does not require to know the effective dimension beforehand. Our method starts with an initial embedding dimension equal to 1 and, over iterations, increases the embedding dimension up to the effective one at most. Hence, our algorithm improves the state-of-the-art computational complexity for solving regularized least-squares problems. Further, we show numerically that it outperforms standard iterative solvers such as the conjugate gradient method and its pre-conditioned version on several standard machine learning datasets. △ Less

Submitted 23 October, 2020; v1 submitted 10 June, 2020; originally announced June 2020.

arXiv:2005.10848 [pdf, other]

doi 10.1109/JSAIT.2020.3041804

Global Multiclass Classification and Dataset Construction via Heterogeneous Local Experts

Authors: Surin Ahn, Ayfer Ozgur, Mert Pilanci

Abstract: In the domains of dataset construction and crowdsourcing, a notable challenge is to aggregate labels from a heterogeneous set of labelers, each of whom is potentially an expert in some subset of tasks (and less reliable in others). To reduce costs of hiring human labelers or training automated labeling systems, it is of interest to minimize the number of labelers while ensuring the reliability of… ▽ More In the domains of dataset construction and crowdsourcing, a notable challenge is to aggregate labels from a heterogeneous set of labelers, each of whom is potentially an expert in some subset of tasks (and less reliable in others). To reduce costs of hiring human labelers or training automated labeling systems, it is of interest to minimize the number of labelers while ensuring the reliability of the resulting dataset. We model this as the problem of performing $K$-class classification using the predictions of smaller classifiers, each trained on a subset of $[K]$, and derive bounds on the number of classifiers needed to accurately infer the true class of an unlabeled sample under both adversarial and stochastic assumptions. By exploiting a connection to the classical set cover problem, we produce a near-optimal scheme for designing such configurations of classifiers which recovers the well known one-vs.-one classification approach as a special case. Experiments with the MNIST and CIFAR-10 datasets demonstrate the favorable accuracy (compared to a centralized classifier) of our aggregation scheme applied to classifiers trained on subsets of the data. These results suggest a new way to automatically label data or adapt an existing set of local classifiers to larger-scale multiclass problems. △ Less

Submitted 5 January, 2021; v1 submitted 21 May, 2020; originally announced May 2020.

Comments: 27 pages, 8 figures, to be published in IEEE Journal on Selected Areas in Information Theory (JSAIT) - Special Issue on Estimation and Inference

arXiv:2003.02948 [pdf, other]

Straggler Robust Distributed Matrix Inverse Approximation

Authors: Neophytos Charalambides, Mert Pilanci, Alfred O. Hero III

Abstract: A cumbersome operation in numerical analysis and linear algebra, optimization, machine learning and engineering algorithms; is inverting large full-rank matrices which appears in various processes and applications. This has both numerical stability and complexity issues, as well as high expected time to compute. We address the latter issue, by proposing an algorithm which uses a black-box least sq… ▽ More A cumbersome operation in numerical analysis and linear algebra, optimization, machine learning and engineering algorithms; is inverting large full-rank matrices which appears in various processes and applications. This has both numerical stability and complexity issues, as well as high expected time to compute. We address the latter issue, by proposing an algorithm which uses a black-box least squares optimization solver as a subroutine, to give an estimate of the inverse (and pseudoinverse) of real nonsingular matrices; by estimating its columns. This also gives it the flexibility to be performed in a distributed manner, thus the estimate can be obtained a lot faster, and can be made robust to \textit{stragglers}. Furthermore, we assume a centralized network with no message passing between the computing nodes, and do not require a matrix factorization; e.g. LU, SVD or QR decomposition beforehand. △ Less

Submitted 22 June, 2022; v1 submitted 5 March, 2020; originally announced March 2020.

Comments: 4 pages, 1 figure, conference

MSC Class: 65F05 (Primary); 94B60 (Secondary)

arXiv:2002.11219 [pdf, ps, other]

Convex Geometry and Duality of Over-parameterized Neural Networks

Authors: Tolga Ergen, Mert Pilanci

Abstract: We develop a convex analytic approach to analyze finite width two-layer ReLU networks. We first prove that an optimal solution to the regularized training problem can be characterized as extreme points of a convex set, where simple solutions are encouraged via its convex geometrical properties. We then leverage this characterization to show that an optimal set of parameters yield linear spline int… ▽ More We develop a convex analytic approach to analyze finite width two-layer ReLU networks. We first prove that an optimal solution to the regularized training problem can be characterized as extreme points of a convex set, where simple solutions are encouraged via its convex geometrical properties. We then leverage this characterization to show that an optimal set of parameters yield linear spline interpolation for regression problems involving one dimensional or rank-one data. We also characterize the classification decision regions in terms of a kernel matrix and minimum $\ell_1$-norm solutions. This is in contrast to Neural Tangent Kernel which is unable to explain predictions of finite width networks. Our convex geometric characterization also provides intuitive explanations of hidden neurons as auto-encoders. In higher dimensions, we show that the training problem can be cast as a finite dimensional convex problem with infinitely many constraints. Then, we apply certain convex relaxations and introduce a cutting-plane algorithm to globally optimize the network. We further analyze the exactness of the relaxations to provide conditions for the convergence to a global optimum. Our analysis also shows that optimal network parameters can be also characterized as interpretable closed-form formulas in some practically relevant special cases. △ Less

Submitted 30 August, 2021; v1 submitted 25 February, 2020; originally announced February 2020.

Comments: Accepted to the Journal of Machine Learning Research (JMLR)

arXiv:2002.10674 [pdf, other]

Separating the Effects of Batch Normalization on CNN Training Speed and Stability Using Classical Adaptive Filter Theory

Authors: Elaina Chai, Mert Pilanci, Boris Murmann

Abstract: Batch Normalization (BatchNorm) is commonly used in Convolutional Neural Networks (CNNs) to improve training speed and stability. However, there is still limited consensus on why this technique is effective. This paper uses concepts from the traditional adaptive filter domain to provide insight into the dynamics and inner workings of BatchNorm. First, we show that the convolution weight updates ha… ▽ More Batch Normalization (BatchNorm) is commonly used in Convolutional Neural Networks (CNNs) to improve training speed and stability. However, there is still limited consensus on why this technique is effective. This paper uses concepts from the traditional adaptive filter domain to provide insight into the dynamics and inner workings of BatchNorm. First, we show that the convolution weight updates have natural modes whose stability and convergence speed are tied to the eigenvalues of the input autocorrelation matrices, which are controlled by BatchNorm through the convolution layers' channel-wise structure. Furthermore, our experiments demonstrate that the speed and stability benefits are distinct effects. At low learning rates, it is BatchNorm's amplification of the smallest eigenvalues that improves convergence speed, while at high learning rates, it is BatchNorm's suppression of the largest eigenvalues that ensures stability. Lastly, we prove that in the first training step, when normalization is needed most, BatchNorm satisfies the same optimization as Normalized Least Mean Square (NLMS), while it continues to approximate this condition in subsequent steps. The analyses provided in this paper lay the groundwork for gaining further insight into the operation of modern neural network structures using adaptive filter theory. △ Less

Submitted 1 June, 2021; v1 submitted 25 February, 2020; originally announced February 2020.

Comments: Presented at Asilomar Conference on Signals, Systems, and Computers, 2020

arXiv:2002.10553 [pdf, other]

Neural Networks are Convex Regularizers: Exact Polynomial-time Convex Optimization Formulations for Two-layer Networks

Authors: Mert Pilanci, Tolga Ergen

Abstract: We develop exact representations of training two-layer neural networks with rectified linear units (ReLUs) in terms of a single convex program with number of variables polynomial in the number of training samples and the number of hidden neurons. Our theory utilizes semi-infinite duality and minimum norm regularization. We show that ReLU networks trained with standard weight decay are equivalent t… ▽ More We develop exact representations of training two-layer neural networks with rectified linear units (ReLUs) in terms of a single convex program with number of variables polynomial in the number of training samples and the number of hidden neurons. Our theory utilizes semi-infinite duality and minimum norm regularization. We show that ReLU networks trained with standard weight decay are equivalent to block $\ell_1$ penalized convex models. Moreover, we show that certain standard convolutional linear networks are equivalent semi-definite programs which can be simplified to $\ell_1$ regularized linear models in a polynomial sized discrete Fourier feature space. △ Less

Submitted 15 August, 2020; v1 submitted 24 February, 2020; originally announced February 2020.

arXiv:2002.09773 [pdf, other]

Revealing the Structure of Deep Neural Networks via Convex Duality

Authors: Tolga Ergen, Mert Pilanci

Abstract: We study regularized deep neural networks (DNNs) and introduce a convex analytic framework to characterize the structure of the hidden layers. We show that a set of optimal hidden layer weights for a norm regularized DNN training problem can be explicitly found as the extreme points of a convex set. For the special case of deep linear networks, we prove that each optimal weight matrix aligns with… ▽ More We study regularized deep neural networks (DNNs) and introduce a convex analytic framework to characterize the structure of the hidden layers. We show that a set of optimal hidden layer weights for a norm regularized DNN training problem can be explicitly found as the extreme points of a convex set. For the special case of deep linear networks, we prove that each optimal weight matrix aligns with the previous layers via duality. More importantly, we apply the same characterization to deep ReLU networks with whitened data and prove the same weight alignment holds. As a corollary, we also prove that norm regularized deep ReLU networks yield spline interpolation for one-dimensional datasets which was previously known only for two-layer networks. Furthermore, we provide closed-form solutions for the optimal layer weights when data is rank-one or whitened. The same analysis also applies to architectures with batch normalization even for arbitrary data. Therefore, we obtain a complete explanation for a recent empirical observation termed Neural Collapse where class means collapse to the vertices of a simplex equiangular tight frame. △ Less

Submitted 11 June, 2021; v1 submitted 22 February, 2020; originally announced February 2020.

Comments: Accepted to ICML 2021

arXiv:2002.09488 [pdf, other]

Optimal Randomized First-Order Methods for Least-Squares Problems

Authors: Jonathan Lacotte, Mert Pilanci

Abstract: We provide an exact analysis of a class of randomized algorithms for solving overdetermined least-squares problems. We consider first-order methods, where the gradients are pre-conditioned by an approximation of the Hessian, based on a subspace embedding of the data matrix. This class of algorithms encompasses several randomized methods among the fastest solvers for least-squares problems. We focu… ▽ More We provide an exact analysis of a class of randomized algorithms for solving overdetermined least-squares problems. We consider first-order methods, where the gradients are pre-conditioned by an approximation of the Hessian, based on a subspace embedding of the data matrix. This class of algorithms encompasses several randomized methods among the fastest solvers for least-squares problems. We focus on two classical embeddings, namely, Gaussian projections and subsampled randomized Hadamard transforms (SRHT). Our key technical innovation is the derivation of the limiting spectral density of SRHT embeddings. Leveraging this novel result, we derive the family of normalized orthogonal polynomials of the SRHT density and we find the optimal pre-conditioned first-order method along with its rate of convergence. Our analysis of Gaussian embeddings proceeds similarly, and leverages classical random matrix theory results. In particular, we show that for a given sketch size, SRHT embeddings exhibits a faster rate of convergence than Gaussian embeddings. Then, we propose a new algorithm by optimizing the computational complexity over the choice of the sketching dimension. To our knowledge, our resulting algorithm yields the best known complexity for solving least-squares problems with no condition number dependence. △ Less

Submitted 25 February, 2020; v1 submitted 21 February, 2020; originally announced February 2020.

Comments: arXiv admin note: text overlap with arXiv:2002.00864

arXiv:2002.06540 [pdf, other]

Distributed Averaging Methods for Randomized Second Order Optimization

Authors: Burak Bartan, Mert Pilanci

Abstract: We consider distributed optimization problems where forming the Hessian is computationally challenging and communication is a significant bottleneck. We develop unbiased parameter averaging methods for randomized second order optimization that employ sampling and sketching of the Hessian. Existing works do not take the bias of the estimators into consideration, which limits their application to ma… ▽ More We consider distributed optimization problems where forming the Hessian is computationally challenging and communication is a significant bottleneck. We develop unbiased parameter averaging methods for randomized second order optimization that employ sampling and sketching of the Hessian. Existing works do not take the bias of the estimators into consideration, which limits their application to massively parallel computation. We provide closed-form formulas for regularization parameters and step sizes that provably minimize the bias for sketched Newton directions. We also extend the framework of second order averaging methods to introduce an unbiased distributed optimization framework for heterogeneous computing systems with varying worker resources. Additionally, we demonstrate the implications of our theoretical findings via large scale experiments performed on a serverless computing platform. △ Less

Submitted 16 February, 2020; originally announced February 2020.

arXiv:2002.06538 [pdf, other]

Distributed Sketching Methods for Privacy Preserving Regression

Authors: Burak Bartan, Mert Pilanci

Abstract: In this work, we study distributed sketching methods for large scale regression problems. We leverage multiple randomized sketches for reducing the problem dimensions as well as preserving privacy and improving straggler resilience in asynchronous distributed systems. We derive novel approximation guarantees for classical sketching methods and analyze the accuracy of parameter averaging for distri… ▽ More In this work, we study distributed sketching methods for large scale regression problems. We leverage multiple randomized sketches for reducing the problem dimensions as well as preserving privacy and improving straggler resilience in asynchronous distributed systems. We derive novel approximation guarantees for classical sketching methods and analyze the accuracy of parameter averaging for distributed sketches. We consider random matrices including Gaussian, randomized Hadamard, uniform sampling and leverage score sampling in the distributed setting. Moreover, we propose a hybrid approach combining sampling and fast random projections for better computational efficiency. We illustrate the performance of distributed sketches in a serverless computing platform with large scale experiments. △ Less

Submitted 19 June, 2020; v1 submitted 16 February, 2020; originally announced February 2020.

arXiv:2002.02291 [pdf, other]

Weighted Gradient Coding with Leverage Score Sampling

Authors: Neophytos Charalambides, Mert Pilanci, Alfred O. Hero III

Abstract: A major hurdle in machine learning is scalability to massive datasets. Approaches to overcome this hurdle include compression of the data matrix and distributing the computations. \textit{Leverage score sampling} provides a compressed approximation of a data matrix using an importance weighted subset. \textit{Gradient coding} has been recently proposed in distributed optimization to compute the gr… ▽ More A major hurdle in machine learning is scalability to massive datasets. Approaches to overcome this hurdle include compression of the data matrix and distributing the computations. \textit{Leverage score sampling} provides a compressed approximation of a data matrix using an importance weighted subset. \textit{Gradient coding} has been recently proposed in distributed optimization to compute the gradient using multiple unreliable worker nodes. By designing coding matrices, gradient coded computations can be made resilient to stragglers, which are nodes in a distributed network that degrade system performance. We present a novel \textit{weighted leverage score} approach, that achieves improved performance for distributed gradient coding by utilizing an importance sampling. △ Less

Submitted 15 September, 2020; v1 submitted 30 January, 2020; originally announced February 2020.

Comments: 4 pages, 2 figures, 2 tables, conference

MSC Class: 94Bxx; 94B60

arXiv:2002.02208 [pdf, ps, other]

Global Convergence of Frank Wolfe on One Hidden Layer Networks

Authors: Alexandre d'Aspremont, Mert Pilanci

Abstract: We derive global convergence bounds for the Frank Wolfe algorithm when training one hidden layer neural networks. When using the ReLU activation function, and under tractable preconditioning assumptions on the sample data set, the linear minimization oracle used to incrementally form the solution can be solved explicitly as a second order cone program. The classical Frank Wolfe algorithm then conv… ▽ More We derive global convergence bounds for the Frank Wolfe algorithm when training one hidden layer neural networks. When using the ReLU activation function, and under tractable preconditioning assumptions on the sample data set, the linear minimization oracle used to incrementally form the solution can be solved explicitly as a second order cone program. The classical Frank Wolfe algorithm then converges with rate $O(1/T)$ where $T$ is both the number of neurons and the number of calls to the oracle. △ Less

Submitted 6 February, 2020; originally announced February 2020.

arXiv:2002.00864 [pdf, other]

Optimal Iterative Sketching with the Subsampled Randomized Hadamard Transform

Authors: Jonathan Lacotte, Sifan Liu, Edgar Dobriban, Mert Pilanci

Abstract: Random projections or sketching are widely used in many algorithmic and learning contexts. Here we study the performance of iterative Hessian sketch for least-squares problems. By leveraging and extending recent results from random matrix theory on the limiting spectrum of matrices randomly projected with the subsampled randomized Hadamard transform, and truncated Haar matrices, we can study and c… ▽ More Random projections or sketching are widely used in many algorithmic and learning contexts. Here we study the performance of iterative Hessian sketch for least-squares problems. By leveraging and extending recent results from random matrix theory on the limiting spectrum of matrices randomly projected with the subsampled randomized Hadamard transform, and truncated Haar matrices, we can study and compare the resulting algorithms to a level of precision that has not been possible before. Our technical contributions include a novel formula for the second moment of the inverse of projected matrices. We also find simple closed-form expressions for asymptotically optimal step-sizes and convergence rates. These show that the convergence rate for Haar and randomized Hadamard matrices are identical, and asymptotically improve upon Gaussian random projections. These techniques may be applied to other algorithms that employ randomized dimension reduction. △ Less

Submitted 23 October, 2020; v1 submitted 3 February, 2020; originally announced February 2020.

arXiv:1912.03514 [pdf, ps, other]

M-IHS: An Accelerated Randomized Preconditioning Method Avoiding Costly Matrix Decompositions

Authors: Ibrahim Kurban Ozaslan, Mert Pilanci, Orhan Arikan

Abstract: Momentum Iterative Hessian Sketch (M-IHS) techniques, a group of solvers for large scale regularized linear Least Squares (LS) problems, are proposed and analyzed in detail. Proposed M-IHS techniques are obtained by incorporating the Heavy Ball Acceleration into the Iterative Hessian Sketch algorithm and they provide significant improvements over the randomized preconditioning techniques. By using… ▽ More Momentum Iterative Hessian Sketch (M-IHS) techniques, a group of solvers for large scale regularized linear Least Squares (LS) problems, are proposed and analyzed in detail. Proposed M-IHS techniques are obtained by incorporating the Heavy Ball Acceleration into the Iterative Hessian Sketch algorithm and they provide significant improvements over the randomized preconditioning techniques. By using approximate solvers along with the iterations, the proposed techniques are capable of avoiding all matrix decompositions and inversions, which is one of the main advantages over the alternative solvers such as the Blendenpik and the LSRN. Similar to the Chebyshev semi-iterations, the M-IHS variants do not use any inner products and eliminate the corresponding synchronization steps in hierarchical or distributed memory systems, yet the M-IHS converges faster than the Chebyshev Semi-iteration based solvers. Lower bounds on the required sketch size for various randomized distributions are established through the error analyses. Unlike the previously proposed approaches to produce a solution approximation, the proposed M-IHS techniques can use sketch sizes that are proportional to the statistical dimension which is always smaller than the rank of the coefficient matrix. The relative computational saving gets more significant as the regularization parameter or the singular value decay rate of the coefficient matrix increase. △ Less

Submitted 28 November, 2020; v1 submitted 7 December, 2019; originally announced December 2019.

MSC Class: 15B52; 65F08; 65F10; 65F22; 65F50; 68W20; 90C06 ACM Class: G.1.3; G.1.6

arXiv:1911.02675 [pdf, other]

Faster Least Squares Optimization

Authors: Jonathan Lacotte, Mert Pilanci

Abstract: We investigate iterative methods with randomized preconditioners for solving overdetermined least-squares problems, where the preconditioners are based on a random embedding of the data matrix. We consider two distinct approaches: the sketch is either computed once (fixed preconditioner), or, the random projection is refreshed at each iteration, i.e., sampled independently of previous ones (varyin… ▽ More We investigate iterative methods with randomized preconditioners for solving overdetermined least-squares problems, where the preconditioners are based on a random embedding of the data matrix. We consider two distinct approaches: the sketch is either computed once (fixed preconditioner), or, the random projection is refreshed at each iteration, i.e., sampled independently of previous ones (varying preconditioners). Although fixed sketching-based preconditioners have received considerable attention in the recent literature, little is known about the performance of refreshed sketches. For a fixed sketch, we characterize the optimal iterative method, that is, the preconditioned conjugate gradient as well as its rate of convergence in terms of the subspace embedding properties of the random embedding. For refreshed sketches, we provide a closed-form formula for the expected error of the iterative Hessian sketch (IHS), a.k.a. preconditioned steepest descent. In contrast to the guarantees and analysis for fixed preconditioners based on subspace embedding properties, our formula is exact and it involves the expected inverse moments of the random projection. Our main technical contribution is to show that this convergence rate is, surprisingly, unimprovable with heavy-ball momentum. Additionally, we construct the locally optimal first-order method whose convergence rate is bounded by that of the IHS. Based on these theoretical and numerical investigations, we do not observe that the additional randomness of refreshed sketches provides a clear advantage over a fixed preconditioner. Therefore, we prescribe PCG as the method of choice along with an optimized sketch size according to our analysis. Our prescribed sketch size yields state-of-the-art computational complexity for solving highly overdetermined linear systems. Lastly, we illustrate the numerical benefits of our algorithms. △ Less

Submitted 13 April, 2021; v1 submitted 6 November, 2019; originally announced November 2019.

arXiv:1907.05984 [pdf, other]

Distributed Black-Box Optimization via Error Correcting Codes

Authors: Burak Bartan, Mert Pilanci

Abstract: We introduce a novel distributed derivative-free optimization framework that is resilient to stragglers. The proposed method employs coded search directions at which the objective function is evaluated, and a decoding step to find the next iterate. Our framework can be seen as an extension of evolution strategies and structured exploration methods where structured search directions were utilized.… ▽ More We introduce a novel distributed derivative-free optimization framework that is resilient to stragglers. The proposed method employs coded search directions at which the objective function is evaluated, and a decoding step to find the next iterate. Our framework can be seen as an extension of evolution strategies and structured exploration methods where structured search directions were utilized. As an application, we consider black-box adversarial attacks on deep convolutional neural networks. Our numerical experiments demonstrate a significant improvement in the computation times. △ Less

Submitted 12 July, 2019; originally announced July 2019.

arXiv:1906.11809 [pdf, other]

High-Dimensional Optimization in Adaptive Random Subspaces

Authors: Jonathan Lacotte, Mert Pilanci, Marco Pavone

Abstract: We propose a new randomized optimization method for high-dimensional problems which can be seen as a generalization of coordinate descent to random subspaces. We show that an adaptive sampling strategy for the random subspace significantly outperforms the oblivious sampling method, which is the common choice in the recent literature. The adaptive subspace can be efficiently generated by a correlat… ▽ More We propose a new randomized optimization method for high-dimensional problems which can be seen as a generalization of coordinate descent to random subspaces. We show that an adaptive sampling strategy for the random subspace significantly outperforms the oblivious sampling method, which is the common choice in the recent literature. The adaptive subspace can be efficiently generated by a correlated random matrix ensemble whose statistics mimic the input data. We prove that the improvement in the relative error of the solution can be tightly characterized in terms of the spectrum of the data matrix, and provide probabilistic upper-bounds. We then illustrate the consequences of our theory with data matrices of different spectral decay. Extensive experimental results show that the proposed approach offers significant speed ups in machine learning problems including logistic regression, kernel classification with random convolution layers and shallow neural networks with rectified linear units. Our analysis is based on convex analysis and Fenchel duality, and establishes connections to sketching and randomized matrix decomposition. △ Less

Submitted 18 December, 2019; v1 submitted 27 June, 2019; originally announced June 2019.

arXiv:1901.06811 [pdf, other]

Straggler Resilient Serverless Computing Based on Polar Codes

Authors: Burak Bartan, Mert Pilanci

Abstract: We propose a serverless computing mechanism for distributed computation based on polar codes. Serverless computing is an emerging cloud based computation model that lets users run their functions on the cloud without provisioning or managing servers. Our proposed approach is a hybrid computing framework that carries out computationally expensive tasks such as linear algebraic operations involving… ▽ More We propose a serverless computing mechanism for distributed computation based on polar codes. Serverless computing is an emerging cloud based computation model that lets users run their functions on the cloud without provisioning or managing servers. Our proposed approach is a hybrid computing framework that carries out computationally expensive tasks such as linear algebraic operations involving large-scale data using serverless computing and does the rest of the processing locally. We address the limitations and reliability issues of serverless platforms such as straggling workers using coding theory, drawing ideas from recent literature on coded computation. The proposed mechanism uses polar codes to ensure straggler-resilience in a computationally effective manner. We provide extensive evidence showing polar codes outperform other coding methods. We have designed a sequential decoder specifically for polar codes in erasure channels with full-precision input and outputs. In addition, we have extended the proposed method to the matrix multiplication case where both matrices being multiplied are coded. The proposed coded computation scheme is implemented for AWS Lambda. Experiment results are presented where the performance of the proposed coded computation technique is tested in optimization via gradient descent. Finally, we introduce the idea of partial polarization which reduces the computational burden of encoding and decoding at the expense of straggler-resilience. △ Less

Submitted 12 July, 2019; v1 submitted 21 January, 2019; originally announced January 2019.

Comments: New results added in the new version. More discussion on serverless computing

arXiv:1901.00035 [pdf, other]

Convex Relaxations of Convolutional Neural Nets

Authors: Burak Bartan, Mert Pilanci

Abstract: We propose convex relaxations for convolutional neural nets with one hidden layer where the output weights are fixed. For convex activation functions such as rectified linear units, the relaxations are convex second order cone programs which can be solved very efficiently. We prove that the relaxation recovers the global minimum under a planted model assumption, given sufficiently many training sa… ▽ More We propose convex relaxations for convolutional neural nets with one hidden layer where the output weights are fixed. For convex activation functions such as rectified linear units, the relaxations are convex second order cone programs which can be solved very efficiently. We prove that the relaxation recovers the global minimum under a planted model assumption, given sufficiently many training samples from a Gaussian distribution. We also identify a phase transition phenomenon in recovering the global minimum for the relaxation. △ Less

Submitted 31 December, 2018; originally announced January 2019.

arXiv:1803.05288 [pdf, other]

Domain Adaptation on Graphs by Learning Aligned Graph Bases

Authors: Mehmet Pilanci, Elif Vural

Abstract: A common assumption in semi-supervised learning with graph models is that the class label function varies smoothly on the data graph, resulting in the rather strict prior that the label function has low-frequency content. Meanwhile, in many classification problems, the label function may vary abruptly in certain graph regions, resulting in high-frequency components. Although the semi-supervised es… ▽ More A common assumption in semi-supervised learning with graph models is that the class label function varies smoothly on the data graph, resulting in the rather strict prior that the label function has low-frequency content. Meanwhile, in many classification problems, the label function may vary abruptly in certain graph regions, resulting in high-frequency components. Although the semi-supervised estimation of class labels is an ill-posed problem in general, in several applications it is possible to find a source graph on which the label function has similar frequency content to that on the target graph where the actual classification problem is defined. In this paper, we propose a method for domain adaptation on graphs motivated by these observations. Our algorithm is based on learning the spectrum of the label function in a source graph with many labeled nodes, and transferring the information of the spectrum to the target graph with fewer labeled nodes. While the frequency content of the class label function can be identified through the graph Fourier transform, it is not easy to transfer the Fourier coefficients directly between the two graphs, since no one-to-one match exists between the Fourier basis vectors of independently constructed graphs in the domain adaptation setting. We solve this problem by learning a transformation between the Fourier bases of the two graphs that flexibly ``aligns'' them. The unknown class label function on the target graph is then reconstructed such that its spectrum matches that on the source graph while also ensuring the consistency with the available labels. The proposed method is tested in the classification of image, online product review, and social network data sets. Comparative experiments suggest that the proposed algorithm performs better than recent domain adaptation methods in the literature in most settings. △ Less

Submitted 4 February, 2020; v1 submitted 14 March, 2018; originally announced March 2018.

arXiv:1505.02250 [pdf, other]

Newton Sketch: A Linear-time Optimization Algorithm with Linear-Quadratic Convergence

Authors: Mert Pilanci, Martin J. Wainwright

Abstract: We propose a randomized second-order method for optimization known as the Newton Sketch: it is based on performing an approximate Newton step using a randomly projected or sub-sampled Hessian. For self-concordant functions, we prove that the algorithm has super-linear convergence with exponentially high probability, with convergence and complexity guarantees that are independent of condition numbe… ▽ More We propose a randomized second-order method for optimization known as the Newton Sketch: it is based on performing an approximate Newton step using a randomly projected or sub-sampled Hessian. For self-concordant functions, we prove that the algorithm has super-linear convergence with exponentially high probability, with convergence and complexity guarantees that are independent of condition numbers and related problem-dependent quantities. Given a suitable initialization, similar guarantees also hold for strongly convex and smooth objectives without self-concordance. When implemented using randomized projections based on a sub-sampled Hadamard basis, the algorithm typically has substantially lower complexity than Newton's method. We also describe extensions of our methods to programs involving convex constraints that are equipped with self-concordant barriers. We discuss and illustrate applications to linear programs, quadratic programs with convex constraints, logistic regression and other generalized linear models, as well as semidefinite programs. △ Less

Submitted 9 May, 2015; originally announced May 2015.

arXiv:1501.06195 [pdf, ps, other]

Randomized sketches for kernels: Fast and optimal non-parametric regression

Authors: Yun Yang, Mert Pilanci, Martin J. Wainwright

Abstract: Kernel ridge regression (KRR) is a standard method for performing non-parametric regression over reproducing kernel Hilbert spaces. Given $n$ samples, the time and space complexity of computing the KRR estimate scale as $\mathcal{O}(n^3)$ and $\mathcal{O}(n^2)$ respectively, and so is prohibitive in many cases. We propose approximations of KRR based on $m$-dimensional randomized sketches of the ke… ▽ More Kernel ridge regression (KRR) is a standard method for performing non-parametric regression over reproducing kernel Hilbert spaces. Given $n$ samples, the time and space complexity of computing the KRR estimate scale as $\mathcal{O}(n^3)$ and $\mathcal{O}(n^2)$ respectively, and so is prohibitive in many cases. We propose approximations of KRR based on $m$-dimensional randomized sketches of the kernel matrix, and study how small the projection dimension $m$ can be chosen while still preserving minimax optimality of the approximate KRR estimate. For various classes of randomized sketches, including those based on Gaussian and randomized Hadamard matrices, we prove that it suffices to choose the sketch dimension $m$ proportional to the statistical dimension (modulo logarithmic factors). Thus, we obtain fast and minimax optimal approximations to the KRR estimate for non-parametric regression. △ Less

Submitted 25 January, 2015; originally announced January 2015.

Comments: 27 pages, 3 figures

arXiv:1411.0347 [pdf, other]

Iterative Hessian sketch: Fast and accurate solution approximation for constrained least-squares

Authors: Mert Pilanci, Martin J. Wainwright

Abstract: We study randomized sketching methods for approximately solving least-squares problem with a general convex constraint. The quality of a least-squares approximation can be assessed in different ways: either in terms of the value of the quadratic objective function (cost approximation), or in terms of some distance measure between the approximate minimizer and the true minimizer (solution approxima… ▽ More We study randomized sketching methods for approximately solving least-squares problem with a general convex constraint. The quality of a least-squares approximation can be assessed in different ways: either in terms of the value of the quadratic objective function (cost approximation), or in terms of some distance measure between the approximate minimizer and the true minimizer (solution approximation). Focusing on the latter criterion, our first main result provides a general lower bound on any randomized method that sketches both the data matrix and vector in a least-squares problem; as a surprising consequence, the most widely used least-squares sketch is sub-optimal for solution approximation. We then present a new method known as the iterative Hessian sketch, and show that it can be used to obtain approximations to the original least-squares problem using a projection dimension proportional to the statistical complexity of the least-squares minimizer, and a logarithmic number of iterations. We illustrate our general theory with simulations for both unconstrained and constrained versions of least-squares, including $\ell_1$-regularization and nuclear norm constraints. We also numerically demonstrate the practicality of our approach in a real face expression classification experiment. △ Less

Submitted 2 November, 2014; originally announced November 2014.

arXiv:1404.7203 [pdf, ps, other]

Randomized Sketches of Convex Programs with Sharp Guarantees

Authors: Mert Pilanci, Martin J. Wainwright

Abstract: Random projection (RP) is a classical technique for reducing storage and computational costs. We analyze RP-based approximations of convex programs, in which the original optimization problem is approximated by the solution of a lower-dimensional problem. Such dimensionality reduction is essential in computation-limited settings, since the complexity of general convex programming can be quite high… ▽ More Random projection (RP) is a classical technique for reducing storage and computational costs. We analyze RP-based approximations of convex programs, in which the original optimization problem is approximated by the solution of a lower-dimensional problem. Such dimensionality reduction is essential in computation-limited settings, since the complexity of general convex programming can be quite high (e.g., cubic for quadratic programs, and substantially higher for semidefinite programs). In addition to computational savings, random projection is also useful for reducing memory usage, and has useful properties for privacy-sensitive optimization. We prove that the approximation ratio of this procedure can be bounded in terms of the geometry of constraint set. For a broad class of random projections, including those based on various sub-Gaussian distributions as well as randomized Hadamard and Fourier transforms, the data matrix defining the cost function can be projected down to the statistical dimension of the tangent cone of the constraints at the original solution, which is often substantially smaller than the original dimension. We illustrate consequences of our theory for various cases, including unconstrained and $\ell_1$-constrained least squares, support vector machines, low-rank matrix estimation, and discuss implications on privacy-sensitive optimization and some connections with de-noising and compressed sensing. △ Less

Submitted 28 April, 2014; originally announced April 2014.

Showing 51–91 of 91 results for author: Pilanci, M