Search | arXiv e-print repository

Fine-grained Analysis and Faster Algorithms for Iteratively Solving Linear Systems

Authors: Michał Dereziński, Daniel LeJeune, Deanna Needell, Elizaveta Rebrova

Abstract: While effective in practice, iterative methods for solving large systems of linear equations can be significantly affected by problem-dependent condition number quantities. This makes characterizing their time complexity challenging, particularly when we wish to make comparisons between deterministic and stochastic methods, that may or may not rely on preconditioning and/or fast matrix multiplicat… ▽ More While effective in practice, iterative methods for solving large systems of linear equations can be significantly affected by problem-dependent condition number quantities. This makes characterizing their time complexity challenging, particularly when we wish to make comparisons between deterministic and stochastic methods, that may or may not rely on preconditioning and/or fast matrix multiplication. In this work, we consider a fine-grained notion of complexity for iterative linear solvers which we call the spectral tail condition number, $κ_\ell$, defined as the ratio between the $\ell$th largest and the smallest singular value of the matrix representing the system. Concretely, we prove the following main algorithmic result: Given an $n\times n$ matrix $A$ and a vector $b$, we can find $\tilde{x}$ such that $\|A\tilde{x}-b\|\leqε\|b\|$ in time $\tilde{O}(κ_\ell\cdot n^2\log 1/ε)$ for any $\ell = O(n^{\frac1{ω-1}})=O(n^{0.729})$, where $ω\approx 2.372$ is the current fast matrix multiplication exponent. This guarantee is achieved by Sketch-and-Project with Nesterov's acceleration. Some of the implications of our result, and of the use of $κ_\ell$, include direct improvement over a fine-grained analysis of the Conjugate Gradient method, suggesting a stronger separation between deterministic and stochastic iterative solvers; and relating the complexity of iterative solvers to the ongoing algorithmic advances in fast matrix multiplication, since the bound on $\ell$ improves with $ω$. Our main technical contributions are new sharp characterizations for the first and second moments of the random projection matrix that commonly arises in sketching algorithms, building on a combination of techniques from combinatorial sampling via determinantal point processes and Gaussian universality results from random matrix theory. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 32 pages

arXiv:2308.15478 [pdf, other]

An Adaptive Tangent Feature Perspective of Neural Networks

Authors: Daniel LeJeune, Sina Alemohammad

Abstract: In order to better understand feature learning in neural networks, we propose a framework for understanding linear models in tangent feature space where the features are allowed to be transformed during training. We consider linear transformations of features, resulting in a joint optimization over parameters and transformations with a bilinear interpolation constraint. We show that this optimizat… ▽ More In order to better understand feature learning in neural networks, we propose a framework for understanding linear models in tangent feature space where the features are allowed to be transformed during training. We consider linear transformations of features, resulting in a joint optimization over parameters and transformations with a bilinear interpolation constraint. We show that this optimization problem has an equivalent linearly constrained optimization with structured regularization that encourages approximately low rank solutions. Specializing to neural network structure, we gain insights into how the features and thus the kernel function change, providing additional nuance to the phenomenon of kernel alignment when the target function is poorly represented using tangent features. We verify our theoretical observations in the kernel alignment of real neural networks. △ Less

Submitted 20 February, 2024; v1 submitted 29 August, 2023; originally announced August 2023.

Comments: 14 pages, 3 figures. Appeared at the First Conference on Parsimony and Learning (CPAL 2024)

arXiv:2307.01850 [pdf, other]

Self-Consuming Generative Models Go MAD

Authors: Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, Richard G. Baraniuk

Abstract: Seismic advances in generative AI algorithms for imagery, text, and other data types has led to the temptation to use synthetic data to train next-generation models. Repeating this process creates an autophagous (self-consuming) loop whose properties are poorly understood. We conduct a thorough analytical and empirical analysis using state-of-the-art generative image models of three families of au… ▽ More Seismic advances in generative AI algorithms for imagery, text, and other data types has led to the temptation to use synthetic data to train next-generation models. Repeating this process creates an autophagous (self-consuming) loop whose properties are poorly understood. We conduct a thorough analytical and empirical analysis using state-of-the-art generative image models of three families of autophagous loops that differ in how fixed or fresh real training data is available through the generations of training and in whether the samples from previous generation models have been biased to trade off data quality versus diversity. Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease. We term this condition Model Autophagy Disorder (MAD), making analogy to mad cow disease. △ Less

Submitted 4 July, 2023; originally announced July 2023.

Comments: 31 pages, 31 figures, pre-print

arXiv:2301.05187 [pdf, other]

WIRE: Wavelet Implicit Neural Representations

Authors: Vishwanath Saragadam, Daniel LeJeune, Jasper Tan, Guha Balakrishnan, Ashok Veeraraghavan, Richard G. Baraniuk

Abstract: Implicit neural representations (INRs) have recently advanced numerous vision-related areas. INR performance depends strongly on the choice of the nonlinear activation function employed in its multilayer perceptron (MLP) network. A wide range of nonlinearities have been explored, but, unfortunately, current INRs designed to have high accuracy also suffer from poor robustness (to signal noise, para… ▽ More Implicit neural representations (INRs) have recently advanced numerous vision-related areas. INR performance depends strongly on the choice of the nonlinear activation function employed in its multilayer perceptron (MLP) network. A wide range of nonlinearities have been explored, but, unfortunately, current INRs designed to have high accuracy also suffer from poor robustness (to signal noise, parameter variation, etc.). Inspired by harmonic analysis, we develop a new, highly accurate and robust INR that does not exhibit this tradeoff. Wavelet Implicit neural REpresentation (WIRE) uses a continuous complex Gabor wavelet activation function that is well-known to be optimally concentrated in space-frequency and to have excellent biases for representing images. A wide range of experiments (image denoising, image inpainting, super-resolution, computed tomography reconstruction, image overfitting, and novel view synthesis with neural radiance fields) demonstrate that WIRE defines the new state of the art in INR accuracy, training time, and robustness. △ Less

Submitted 5 January, 2023; originally announced January 2023.

arXiv:2211.03751 [pdf, other]

Asymptotics of the Sketched Pseudoinverse

Authors: Daniel LeJeune, Pratik Patil, Hamid Javadi, Richard G. Baraniuk, Ryan J. Tibshirani

Abstract: We take a random matrix theory approach to random sketching and show an asymptotic first-order equivalence of the regularized sketched pseudoinverse of a positive semidefinite matrix to a certain evaluation of the resolvent of the same matrix. We focus on real-valued regularization and extend previous results on an asymptotic equivalence of random matrices to the real setting, providing a precise… ▽ More We take a random matrix theory approach to random sketching and show an asymptotic first-order equivalence of the regularized sketched pseudoinverse of a positive semidefinite matrix to a certain evaluation of the resolvent of the same matrix. We focus on real-valued regularization and extend previous results on an asymptotic equivalence of random matrices to the real setting, providing a precise characterization of the equivalence even under negative regularization, including a precise characterization of the smallest nonzero eigenvalue of the sketched matrix, which may be of independent interest. We then further characterize the second-order equivalence of the sketched pseudoinverse. We also apply our results to the analysis of the sketch-and-project method and to sketched ridge regression. Lastly, we prove that these results generalize to asymptotically free sketching matrices, obtaining the resulting equivalence for orthogonal sketching matrices and comparing our results to several common sketches used in practice. △ Less

Submitted 6 October, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

Comments: 45 pages, 9 figures

MSC Class: 15B52; 46L54; 62J07

arXiv:2210.11589 [pdf, other]

Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization

Authors: Daniel LeJeune, Jiayu Liu, Reinhard Heckel

Abstract: Machine learning systems are often applied to data that is drawn from a different distribution than the training distribution. Recent work has shown that for a variety of classification and signal reconstruction problems, the out-of-distribution performance is strongly linearly correlated with the in-distribution performance. If this relationship or more generally a monotonic one holds, it has imp… ▽ More Machine learning systems are often applied to data that is drawn from a different distribution than the training distribution. Recent work has shown that for a variety of classification and signal reconstruction problems, the out-of-distribution performance is strongly linearly correlated with the in-distribution performance. If this relationship or more generally a monotonic one holds, it has important consequences. For example, it allows to optimize performance on one distribution as a proxy for performance on the other. In this paper, we study conditions under which a monotonic relationship between the performances of a model on two distributions is expected. We prove an exact asymptotic linear relation for squared error and a monotonic relation for misclassification error for ridge-regularized general linear models under covariate shift, as well as an approximate linear relation for linear inverse problems. △ Less

Submitted 20 July, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

Comments: 34 pages, 7 figures

arXiv:2205.14055 [pdf, other]

A Blessing of Dimensionality in Membership Inference through Regularization

Authors: Jasper Tan, Daniel LeJeune, Blake Mason, Hamid Javadi, Richard G. Baraniuk

Abstract: Is overparameterization a privacy liability? In this work, we study the effect that the number of parameters has on a classifier's vulnerability to membership inference attacks. We first demonstrate how the number of parameters of a model can induce a privacy--utility trade-off: increasing the number of parameters generally improves generalization performance at the expense of lower privacy. Howev… ▽ More Is overparameterization a privacy liability? In this work, we study the effect that the number of parameters has on a classifier's vulnerability to membership inference attacks. We first demonstrate how the number of parameters of a model can induce a privacy--utility trade-off: increasing the number of parameters generally improves generalization performance at the expense of lower privacy. However, remarkably, we then show that if coupled with proper regularization, increasing the number of parameters of a model can actually simultaneously increase both its privacy and performance, thereby eliminating the privacy--utility trade-off. Theoretically, we demonstrate this curious phenomenon for logistic regression with ridge regularization in a bi-level feature ensemble setting. Pursuant to our theoretical exploration, we develop a novel leave-one-out analysis tool to precisely characterize the vulnerability of a linear classifier to the optimal membership inference attack. We empirically exhibit this "blessing of dimensionality" for neural networks on a variety of tasks using early stop** as the regularizer. △ Less

Submitted 13 April, 2023; v1 submitted 27 May, 2022; originally announced May 2022.

Comments: 26 pages, 14 figures

arXiv:2106.07769 [pdf, other]

The Flip Side of the Reweighted Coin: Duality of Adaptive Dropout and Regularization

Authors: Daniel LeJeune, Hamid Javadi, Richard G. Baraniuk

Abstract: Among the most successful methods for sparsifying deep (neural) networks are those that adaptively mask the network weights throughout training. By examining this masking, or dropout, in the linear case, we uncover a duality between such adaptive methods and regularization through the so-called "$η$-trick" that casts both as iteratively reweighted optimizations. We show that any dropout strategy t… ▽ More Among the most successful methods for sparsifying deep (neural) networks are those that adaptively mask the network weights throughout training. By examining this masking, or dropout, in the linear case, we uncover a duality between such adaptive methods and regularization through the so-called "$η$-trick" that casts both as iteratively reweighted optimizations. We show that any dropout strategy that adapts to the weights in a monotonic way corresponds to an effective subquadratic regularization penalty, and therefore leads to sparse solutions. We obtain the effective penalties for several popular sparsification strategies, which are remarkably similar to classical penalties commonly used in sparse optimization. Considering variational dropout as a case study, we demonstrate similar empirical behavior between the adaptive dropout method and classical methods on the task of deep network sparsification, validating our theory. △ Less

Submitted 3 January, 2022; v1 submitted 14 June, 2021; originally announced June 2021.

Comments: 19 pages, 2 figures. Appeared in NeurIPS 2021. Small typographical correction

arXiv:2103.05621 [pdf, other]

The Common Intuition to Transfer Learning Can Win or Lose: Case Studies for Linear Regression

Authors: Yehuda Dar, Daniel LeJeune, Richard G. Baraniuk

Abstract: We study a fundamental transfer learning process from source to target linear regression tasks, including overparameterized settings where there are more learned parameters than data samples. The target task learning is addressed by using its training data together with the parameters previously computed for the source task. We define a transfer learning approach to the target task as a linear reg… ▽ More We study a fundamental transfer learning process from source to target linear regression tasks, including overparameterized settings where there are more learned parameters than data samples. The target task learning is addressed by using its training data together with the parameters previously computed for the source task. We define a transfer learning approach to the target task as a linear regression optimization with a regularization on the distance between the to-be-learned target parameters and the already-learned source parameters. We analytically characterize the generalization performance of our transfer learning approach and demonstrate its ability to resolve the peak in generalization errors in double descent phenomena of the minimum L2-norm solution to linear regression. Moreover, we show that for sufficiently related tasks, the optimally tuned transfer learning approach can outperform the optimally tuned ridge regression method, even when the true parameter vector conforms to an isotropic Gaussian prior distribution. Namely, we demonstrate that transfer learning can beat the minimum mean square error (MMSE) solution of the independent target task. Our results emphasize the ability of transfer learning to extend the solution space to the target task and, by that, to have an improved MMSE solution. We formulate the linear MMSE solution to our transfer learning setting and point out its key differences from the common design philosophy to transfer learning. △ Less

Submitted 31 May, 2024; v1 submitted 9 March, 2021; originally announced March 2021.

arXiv:2010.13975 [pdf, other]

Wearing a MASK: Compressed Representations of Variable-Length Sequences Using Recurrent Neural Tangent Kernels

Authors: Sina Alemohammad, Hossein Babaei, Randall Balestriero, Matt Y. Cheung, Ahmed Imtiaz Humayun, Daniel LeJeune, Naiming Liu, Lorenzo Luzi, Jasper Tan, Zichao Wang, Richard G. Baraniuk

Abstract: High dimensionality poses many challenges to the use of data, from visualization and interpretation, to prediction and storage for historical preservation. Techniques abound to reduce the dimensionality of fixed-length sequences, yet these methods rarely generalize to variable-length sequences. To address this gap, we extend existing methods that rely on the use of kernels to variable-length seque… ▽ More High dimensionality poses many challenges to the use of data, from visualization and interpretation, to prediction and storage for historical preservation. Techniques abound to reduce the dimensionality of fixed-length sequences, yet these methods rarely generalize to variable-length sequences. To address this gap, we extend existing methods that rely on the use of kernels to variable-length sequences via use of the Recurrent Neural Tangent Kernel (RNTK). Since a deep neural network with ReLu activation is a Max-Affine Spline Operator (MASO), we dub our approach Max-Affine Spline Kernel (MASK). We demonstrate how MASK can be used to extend principal components analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) and apply these new algorithms to separate synthetic time series data sampled from second-order differential equations. △ Less

Submitted 17 April, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

arXiv:1910.04743 [pdf, other]

The Implicit Regularization of Ordinary Least Squares Ensembles

Authors: Daniel LeJeune, Hamid Javadi, Richard G. Baraniuk

Abstract: Ensemble methods that average over a collection of independent predictors that are each limited to a subsampling of both the examples and features of the training data command a significant presence in machine learning, such as the ever-popular random forest, yet the nature of the subsampling effect, particularly of the features, is not well understood. We study the case of an ensemble of linear p… ▽ More Ensemble methods that average over a collection of independent predictors that are each limited to a subsampling of both the examples and features of the training data command a significant presence in machine learning, such as the ever-popular random forest, yet the nature of the subsampling effect, particularly of the features, is not well understood. We study the case of an ensemble of linear predictors, where each individual predictor is fit using ordinary least squares on a random submatrix of the data matrix. We show that, under standard Gaussianity assumptions, when the number of features selected for each predictor is optimally tuned, the asymptotic risk of a large ensemble is equal to the asymptotic ridge regression risk, which is known to be optimal among linear predictors in this setting. In addition to eliciting this implicit regularization that results from subsampling, we also connect this ensemble to the dropout technique used in training deep (neural) networks, another strategy that has been shown to have a ridge-like regularizing effect. △ Less

Submitted 24 March, 2020; v1 submitted 10 October, 2019; originally announced October 2019.

Comments: 18 pages, 4 figures. To appear in AISTATS 2020

arXiv:1905.11639 [pdf, other]

Implicit Rugosity Regularization via Data Augmentation

Authors: Daniel LeJeune, Randall Balestriero, Hamid Javadi, Richard G. Baraniuk

Abstract: Deep (neural) networks have been applied productively in a wide range of supervised and unsupervised learning tasks. Unlike classical machine learning algorithms, deep networks typically operate in the \emph{overparameterized} regime, where the number of parameters is larger than the number of training data points. Consequently, understanding the generalization properties and the role of (explicit… ▽ More Deep (neural) networks have been applied productively in a wide range of supervised and unsupervised learning tasks. Unlike classical machine learning algorithms, deep networks typically operate in the \emph{overparameterized} regime, where the number of parameters is larger than the number of training data points. Consequently, understanding the generalization properties and the role of (explicit or implicit) regularization in these networks is of great importance. In this work, we explore how the oft-used heuristic of \emph{data augmentation} imposes an {\em implicit regularization} penalty of a novel measure of the \emph{rugosity} or "roughness" based on the tangent Hessian of the function fit to the training data. △ Less

Submitted 10 October, 2019; v1 submitted 28 May, 2019; originally announced May 2019.

Comments: 15 pages, 12 figures

arXiv:1905.09190 [pdf, other]

Thresholding Graph Bandits with GrAPL

Authors: Daniel LeJeune, Gautam Dasarathy, Richard G. Baraniuk

Abstract: In this paper, we introduce a new online decision making paradigm that we call Thresholding Graph Bandits. The main goal is to efficiently identify a subset of arms in a multi-armed bandit problem whose means are above a specified threshold. While traditionally in such problems, the arms are assumed to be independent, in our paradigm we further suppose that we have access to the similarity between… ▽ More In this paper, we introduce a new online decision making paradigm that we call Thresholding Graph Bandits. The main goal is to efficiently identify a subset of arms in a multi-armed bandit problem whose means are above a specified threshold. While traditionally in such problems, the arms are assumed to be independent, in our paradigm we further suppose that we have access to the similarity between the arms in the form of a graph, allowing us gain information about the arm means in fewer samples. Such settings play a key role in a wide range of modern decision making problems where rapid decisions need to be made in spite of the large number of options available at each time. We present GrAPL, a novel algorithm for the thresholding graph bandit problem. We demonstrate theoretically that this algorithm is effective in taking advantage of the graph structure when available and the reward function homophily (that strongly connected arms have similar rewards) when favorable. We confirm these theoretical findings via experiments on both synthetic and real data. △ Less

Submitted 24 March, 2020; v1 submitted 22 May, 2019; originally announced May 2019.

Comments: 14 pages, 3 figures. To appear in AISTATS 2020

arXiv:1902.09465 [pdf, other]

Adaptive Estimation for Approximate k-Nearest-Neighbor Computations

Authors: Daniel LeJeune, Richard G. Baraniuk, Reinhard Heckel

Abstract: Algorithms often carry out equally many computations for "easy" and "hard" problem instances. In particular, algorithms for finding nearest neighbors typically have the same running time regardless of the particular problem instance. In this paper, we consider the approximate k-nearest-neighbor problem, which is the problem of finding a subset of O(k) points in a given set of points that contains… ▽ More Algorithms often carry out equally many computations for "easy" and "hard" problem instances. In particular, algorithms for finding nearest neighbors typically have the same running time regardless of the particular problem instance. In this paper, we consider the approximate k-nearest-neighbor problem, which is the problem of finding a subset of O(k) points in a given set of points that contains the set of k nearest neighbors of a given query point. We propose an algorithm based on adaptively estimating the distances, and show that it is essentially optimal out of algorithms that are only allowed to adaptively estimate distances. We then demonstrate both theoretically and experimentally that the algorithm can achieve significant speedups relative to the naive method. △ Less

Submitted 25 February, 2019; originally announced February 2019.

Comments: 11 pages, 2 figures. To appear in AISTATS 2019

Journal ref: Proceedings of Machine Learning Research 89 (2019):3099-3107

arXiv:1806.04310 [pdf, other]

MISSION: Ultra Large-Scale Feature Selection using Count-Sketches

Authors: Amirali Aghazadeh, Ryan Spring, Daniel LeJeune, Gautam Dasarathy, Anshumali Shrivastava, Richard G. Baraniuk

Abstract: Feature selection is an important challenge in machine learning. It plays a crucial role in the explainability of machine-driven decisions that are rapidly permeating throughout modern society. Unfortunately, the explosion in the size and dimensionality of real-world datasets poses a severe challenge to standard feature selection algorithms. Today, it is not uncommon for datasets to have billions… ▽ More Feature selection is an important challenge in machine learning. It plays a crucial role in the explainability of machine-driven decisions that are rapidly permeating throughout modern society. Unfortunately, the explosion in the size and dimensionality of real-world datasets poses a severe challenge to standard feature selection algorithms. Today, it is not uncommon for datasets to have billions of dimensions. At such scale, even storing the feature vector is impossible, causing most existing feature selection methods to fail. Workarounds like feature hashing, a standard approach to large-scale machine learning, helps with the computational feasibility, but at the cost of losing the interpretability of features. In this paper, we present MISSION, a novel framework for ultra large-scale feature selection that performs stochastic gradient descent while maintaining an efficient representation of the features in memory using a Count-Sketch data structure. MISSION retains the simplicity of feature hashing without sacrificing the interpretability of the features while using only O(log^2(p)) working memory. We demonstrate that MISSION accurately and efficiently performs feature selection on real-world, large-scale datasets with billions of dimensions. △ Less

Submitted 11 June, 2018; originally announced June 2018.

arXiv:1303.0866 [pdf]

Adaptive Partitioning and its Applicability to a Highly Scalable and Available Geo-Spatial Indexing Solution

Authors: David W. LeJeune Jr

Abstract: Satellite Tracking of People (STOP) tracks thousands of GPS-enabled devices 24 hours a day and 365 days a year. With locations captured for each device every minute, STOP servers receive tens of millions of points each day. In addition to cataloging these points in real-time, STOP must also respond to questions from customers such as, "What devices of mine were at this location two months ago?" Th… ▽ More Satellite Tracking of People (STOP) tracks thousands of GPS-enabled devices 24 hours a day and 365 days a year. With locations captured for each device every minute, STOP servers receive tens of millions of points each day. In addition to cataloging these points in real-time, STOP must also respond to questions from customers such as, "What devices of mine were at this location two months ago?" They often then broaden their question to one such as, "Which of my devices have ever been at this location?" The processing requirements necessary to answer these questions while continuing to process inbound data in real-time is non-trivial. To meet this demand, STOP developed Adaptive Partitioning to provide a cost-effective and highly available hardware platform for the geographical and time-spatial indexing capabilities necessary for responding to customer data requests while continuing to catalog inbound data in real-time. △ Less

Submitted 4 March, 2013; originally announced March 2013.

Showing 1–16 of 16 results for author: LeJeune, D