-
A Clipped Trip: the Dynamics of SGD with Gradient Clip** in High-Dimensions
Authors:
Noah Marshall,
Ke Liang Xiao,
Atish Agarwala,
Elliot Paquette
Abstract:
The success of modern machine learning is due in part to the adaptive optimization methods that have been developed to deal with the difficulties of training large models over complex datasets. One such method is gradient clip**: a practical procedure with limited theoretical underpinnings. In this work, we study clip** in a least squares problem under streaming SGD. We develop a theoretical a…
▽ More
The success of modern machine learning is due in part to the adaptive optimization methods that have been developed to deal with the difficulties of training large models over complex datasets. One such method is gradient clip**: a practical procedure with limited theoretical underpinnings. In this work, we study clip** in a least squares problem under streaming SGD. We develop a theoretical analysis of the learning dynamics in the limit of large intrinsic dimension-a model and dataset dependent notion of dimensionality. In this limit we find a deterministic equation that describes the evolution of the loss. We show that with Gaussian noise clip** cannot improve SGD performance. Yet, in other noisy settings, clip** can provide benefits with tuning of the clip** threshold. In these cases, clip** biases updates in a way beneficial to training which cannot be recovered by SGD under any schedule. We conclude with a discussion about the links between high-dimensional clip** and neural network training.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Learning equivariant tensor functions with applications to sparse vector recovery
Authors:
Wilson G. Gregory,
Josué Tonelli-Cueto,
Nicholas F. Marshall,
Andrew S. Lee,
Soledad Villar
Abstract:
This work characterizes equivariant polynomial functions from tuples of tensor inputs to tensor outputs. Loosely motivated by physics, we focus on equivariant functions with respect to the diagonal action of the orthogonal group on tensors. We show how to extend this characterization to other linear algebraic groups, including the Lorentz and symplectic groups.
Our goal behind these characteriza…
▽ More
This work characterizes equivariant polynomial functions from tuples of tensor inputs to tensor outputs. Loosely motivated by physics, we focus on equivariant functions with respect to the diagonal action of the orthogonal group on tensors. We show how to extend this characterization to other linear algebraic groups, including the Lorentz and symplectic groups.
Our goal behind these characterizations is to define equivariant machine learning models. In particular, we focus on the sparse vector estimation problem. This problem has been broadly studied in the theoretical computer science literature, and explicit spectral methods, derived by techniques from sum-of-squares, can be shown to recover sparse vectors under certain assumptions. Our numerical results show that the proposed equivariant machine learning models can learn spectral methods that outperform the best theoretically known spectral methods in some regimes. The experiments also suggest that learned spectral methods can solve the problem in settings that have not yet been theoretically analyzed.
This is an example of a promising direction in which theory can inform machine learning models and machine learning models could inform theory.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Laplace-HDC: Understanding the geometry of binary hyperdimensional computing
Authors:
Saeid Pourmand,
Wyatt D. Whiting,
Alireza Aghasi,
Nicholas F. Marshall
Abstract:
This paper studies the geometry of binary hyperdimensional computing (HDC), a computational scheme in which data are encoded using high-dimensional binary vectors. We establish a result about the similarity structure induced by the HDC binding operator and show that the Laplace kernel naturally arises in this setting, motivating our new encoding method Laplace-HDC, which improves upon previous met…
▽ More
This paper studies the geometry of binary hyperdimensional computing (HDC), a computational scheme in which data are encoded using high-dimensional binary vectors. We establish a result about the similarity structure induced by the HDC binding operator and show that the Laplace kernel naturally arises in this setting, motivating our new encoding method Laplace-HDC, which improves upon previous methods. We describe how our results indicate limitations of binary HDC in encoding spatial information from images and discuss potential solutions, including using Haar convolutional features and the definition of a translation-equivariant HDC encoding. Several numerical experiments highlighting the improved accuracy of Laplace-HDC in contrast to alternative methods are presented. We also numerically study other aspects of the proposed framework such as robustness and the underlying translation-equivariant encoding.
△ Less
Submitted 26 April, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Fast expansion into harmonics on the disk: a steerable basis with fast radial convolutions
Authors:
Nicholas F. Marshall,
Oscar Mickelin,
Amit Singer
Abstract:
We present a fast and numerically accurate method for expanding digitized $L \times L$ images representing functions on $[-1,1]^2$ supported on the disk $\{x \in \mathbb{R}^2 : |x|<1\}$ in the harmonics (Dirichlet Laplacian eigenfunctions) on the disk. Our method, which we refer to as the Fast Disk Harmonics Transform (FDHT), runs in $O(L^2 \log L)$ operations. This basis is also known as the Four…
▽ More
We present a fast and numerically accurate method for expanding digitized $L \times L$ images representing functions on $[-1,1]^2$ supported on the disk $\{x \in \mathbb{R}^2 : |x|<1\}$ in the harmonics (Dirichlet Laplacian eigenfunctions) on the disk. Our method, which we refer to as the Fast Disk Harmonics Transform (FDHT), runs in $O(L^2 \log L)$ operations. This basis is also known as the Fourier-Bessel basis, and it has several computational advantages: it is orthogonal, ordered by frequency, and steerable in the sense that images expanded in the basis can be rotated by applying a diagonal transform to the coefficients. Moreover, we show that convolution with radial functions can also be efficiently computed by applying a diagonal transform to the coefficients.
△ Less
Submitted 21 December, 2022; v1 submitted 27 July, 2022;
originally announced July 2022.
-
An optimal scheduled learning rate for a randomized Kaczmarz algorithm
Authors:
Nicholas F. Marshall,
Oscar Mickelin
Abstract:
We study how the learning rate affects the performance of a relaxed randomized Kaczmarz algorithm for solving $A x \approx b + \varepsilon$, where $A x =b$ is a consistent linear system and $\varepsilon$ has independent mean zero random entries. We derive a learning rate schedule which optimizes a bound on the expected error that is sharp in certain cases; in contrast to the exponential convergenc…
▽ More
We study how the learning rate affects the performance of a relaxed randomized Kaczmarz algorithm for solving $A x \approx b + \varepsilon$, where $A x =b$ is a consistent linear system and $\varepsilon$ has independent mean zero random entries. We derive a learning rate schedule which optimizes a bound on the expected error that is sharp in certain cases; in contrast to the exponential convergence of the standard randomized Kaczmarz algorithm, our optimized bound involves the reciprocal of the Lambert-$W$ function of an exponential.
△ Less
Submitted 9 August, 2022; v1 submitted 24 February, 2022;
originally announced February 2022.
-
A common variable minimax theorem for graphs
Authors:
Ronald R. Coifman,
Nicholas F. Marshall,
Stefan Steinerberger
Abstract:
Let $\mathcal{G} = \{G_1 = (V, E_1), \dots, G_m = (V, E_m)\}$ be a collection of $m$ graphs defined on a common set of vertices $V$ but with different edge sets $E_1, \dots, E_m$. Informally, a function $f :V \rightarrow \mathbb{R}$ is smooth with respect to $G_k = (V,E_k)$ if $f(u) \sim f(v)$ whenever $(u, v) \in E_k$. We study the problem of understanding whether there exists a nonconstant funct…
▽ More
Let $\mathcal{G} = \{G_1 = (V, E_1), \dots, G_m = (V, E_m)\}$ be a collection of $m$ graphs defined on a common set of vertices $V$ but with different edge sets $E_1, \dots, E_m$. Informally, a function $f :V \rightarrow \mathbb{R}$ is smooth with respect to $G_k = (V,E_k)$ if $f(u) \sim f(v)$ whenever $(u, v) \in E_k$. We study the problem of understanding whether there exists a nonconstant function that is smooth with respect to all graphs in $\mathcal{G}$, simultaneously, and how to find it if it exists.
△ Less
Submitted 30 July, 2021;
originally announced July 2021.
-
FairCal: Fairness Calibration for Face Verification
Authors:
Tiago Salvador,
Stephanie Cairns,
Vikram Voleti,
Noah Marshall,
Adam Oberman
Abstract:
Despite being widely used, face recognition models suffer from bias: the probability of a false positive (incorrect face match) strongly depends on sensitive attributes such as the ethnicity of the face. As a result, these models can disproportionately and negatively impact minority groups, particularly when used by law enforcement. The majority of bias reduction methods have several drawbacks: th…
▽ More
Despite being widely used, face recognition models suffer from bias: the probability of a false positive (incorrect face match) strongly depends on sensitive attributes such as the ethnicity of the face. As a result, these models can disproportionately and negatively impact minority groups, particularly when used by law enforcement. The majority of bias reduction methods have several drawbacks: they use an end-to-end retraining approach, may not be feasible due to privacy issues, and often reduce accuracy. An alternative approach is post-processing methods that build fairer decision classifiers using the features of pre-trained models, thus avoiding the cost of retraining. However, they still have drawbacks: they reduce accuracy (AGENDA, PASS, FTC), or require retuning for different false positive rates (FSN). In this work, we introduce the Fairness Calibration (FairCal) method, a post-training approach that simultaneously: (i) increases model accuracy (improving the state-of-the-art), (ii) produces fairly-calibrated probabilities, (iii) significantly reduces the gap in the false positive rates, (iv) does not require knowledge of the sensitive attribute, and (v) does not require retraining, training an additional model, or retuning. We apply it to the task of Face Verification, and obtain state-of-the-art results with all the above advantages.
△ Less
Submitted 30 March, 2022; v1 submitted 7 June, 2021;
originally announced June 2021.
-
Multi-target detection with rotations
Authors:
Tamir Bendory,
Ti-Yen Lan,
Nicholas F. Marshall,
Iris Rukshin,
Amit Singer
Abstract:
We consider the multi-target detection problem of estimating a two-dimensional target image from a large noisy measurement image that contains many randomly rotated and translated copies of the target image. Motivated by single-particle cryo-electron microscopy, we focus on the low signal-to-noise regime, where it is difficult to estimate the locations and orientations of the target images in the…
▽ More
We consider the multi-target detection problem of estimating a two-dimensional target image from a large noisy measurement image that contains many randomly rotated and translated copies of the target image. Motivated by single-particle cryo-electron microscopy, we focus on the low signal-to-noise regime, where it is difficult to estimate the locations and orientations of the target images in the measurement. Our approach uses autocorrelation analysis to estimate rotationally and translationally invariant features of the target image. We demonstrate that, regardless of the level of noise, our technique can be used to recover the target image when the measurement is sufficiently large.
△ Less
Submitted 2 September, 2022; v1 submitted 19 January, 2021;
originally announced January 2021.
-
Image recovery from rotational and translational invariants
Authors:
Nicholas F. Marshall,
Ti-Yen Lan,
Tamir Bendory,
Amit Singer
Abstract:
We introduce a framework for recovering an image from its rotationally and translationally invariant features based on autocorrelation analysis. This work is an instance of the multi-target detection statistical model, which is mainly used to study the mathematical and computational properties of single-particle reconstruction using cryo-electron microscopy (cryo-EM) at low signal-to-noise ratios.…
▽ More
We introduce a framework for recovering an image from its rotationally and translationally invariant features based on autocorrelation analysis. This work is an instance of the multi-target detection statistical model, which is mainly used to study the mathematical and computational properties of single-particle reconstruction using cryo-electron microscopy (cryo-EM) at low signal-to-noise ratios. We demonstrate with synthetic numerical experiments that an image can be reconstructed from rotationally and translationally invariant features and show that the reconstruction is robust to noise. These results constitute an important step towards the goal of structure determination of small biomolecules using cryo-EM.
△ Less
Submitted 22 October, 2019;
originally announced October 2019.
-
Time Coupled Diffusion Maps
Authors:
Nicholas F. Marshall,
Matthew J. Hirn
Abstract:
We consider a collection of $n$ points in $\mathbb{R}^d$ measured at $m$ times, which are encoded in an $n \times d \times m$ data tensor. Our objective is to define a single embedding of the $n$ points into Euclidean space which summarizes the geometry as described by the data tensor. In the case of a fixed data set, diffusion maps (and related graph Laplacian methods) define such an embedding vi…
▽ More
We consider a collection of $n$ points in $\mathbb{R}^d$ measured at $m$ times, which are encoded in an $n \times d \times m$ data tensor. Our objective is to define a single embedding of the $n$ points into Euclidean space which summarizes the geometry as described by the data tensor. In the case of a fixed data set, diffusion maps (and related graph Laplacian methods) define such an embedding via the eigenfunctions of a diffusion operator constructed on the data. Given a sequence of $m$ measurements of $n$ points, we construct a corresponding sequence of diffusion operators and study their product. Via this product, we introduce the notion of time coupled diffusion distance and time coupled diffusion maps which have natural geometric and probabilistic interpretations. To frame our method in the context of manifold learning, we model evolving data as samples from an underlying manifold with a time dependent metric, and we describe a connection of our method to the heat equation over a manifold with time dependent metric.
△ Less
Submitted 13 November, 2017; v1 submitted 11 August, 2016;
originally announced August 2016.