-
Filtered not Mixed: Stochastic Filtering-Based Online Gating for Mixture of Large Language Models
Authors:
Raeid Saqur,
Anastasis Kratsios,
Florian Krach,
Yannick Limmer,
Jacob-Junqi Tian,
John Willes,
Blanka Horvath,
Frank Rudzicz
Abstract:
We propose MoE-F -- a formalised mechanism for combining $N$ pre-trained expert Large Language Models (LLMs) in online time-series prediction tasks by adaptively forecasting the best weighting of LLM predictions at every time step. Our mechanism leverages the conditional information in each expert's running performance to forecast the best combination of LLMs for predicting the time series in its…
▽ More
We propose MoE-F -- a formalised mechanism for combining $N$ pre-trained expert Large Language Models (LLMs) in online time-series prediction tasks by adaptively forecasting the best weighting of LLM predictions at every time step. Our mechanism leverages the conditional information in each expert's running performance to forecast the best combination of LLMs for predicting the time series in its next step. Diverging from static (learned) Mixture of Experts (MoE) methods, MoE-F employs time-adaptive stochastic filtering techniques to combine experts. By framing the expert selection problem as a finite state-space, continuous-time Hidden Markov model (HMM), we can leverage the Wohman-Shiryaev filter. Our approach first constructs $N$ parallel filters corresponding to each of the $N$ individual LLMs. Each filter proposes its best combination of LLMs, given the information that they have access to. Subsequently, the $N$ filter outputs are aggregated to optimize a lower bound for the loss of the aggregated LLMs, which can be optimized in closed-form, thus generating our ensemble predictor. Our contributions here are: (I) the MoE-F algorithm -- deployable as a plug-and-play filtering harness, (II) theoretical optimality guarantees of the proposed filtering-based gating algorithm, and (III) empirical evaluation and ablative results using state of the art foundational and MoE LLMs on a real-world Financial Market Movement task where MoE-F attains a remarkable 17% absolute and 48.5% relative F1 measure improvement over the next best performing individual LLM expert.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Low-dimensional approximations of the conditional law of Volterra processes: a non-positive curvature approach
Authors:
Reza Arabpour,
John Armstrong,
Luca Galimberti,
Anastasis Kratsios,
Giulia Livieri
Abstract:
Predicting the conditional evolution of Volterra processes with stochastic volatility is a crucial challenge in mathematical finance. While deep neural network models offer promise in approximating the conditional law of such processes, their effectiveness is hindered by the curse of dimensionality caused by the infinite dimensionality and non-smooth nature of these problems. To address this, we p…
▽ More
Predicting the conditional evolution of Volterra processes with stochastic volatility is a crucial challenge in mathematical finance. While deep neural network models offer promise in approximating the conditional law of such processes, their effectiveness is hindered by the curse of dimensionality caused by the infinite dimensionality and non-smooth nature of these problems. To address this, we propose a two-step solution. Firstly, we develop a stable dimension reduction technique, projecting the law of a reasonably broad class of Volterra process onto a low-dimensional statistical manifold of non-positive sectional curvature. Next, we introduce a sequentially deep learning model tailored to the manifold's geometry, which we show can approximate the projected conditional law of the Volterra process. Our model leverages an auxiliary hypernetwork to dynamically update its internal parameters, allowing it to encode non-stationary dynamics of the Volterra process, and it can be interpreted as a gating mechanism in a mixture of expert models where each expert is specialized at a specific point in time. Our hypernetwork further allows us to achieve approximation rates that would seemingly only be possible with very large networks.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Reality Only Happens Once: Single-Path Generalization Bounds for Transformers
Authors:
Yannick Limmer,
Anastasis Kratsios,
Xuwei Yang,
Raeid Saqur,
Blanka Horvath
Abstract:
One of the inherent challenges in deploying transformers on time series is that \emph{reality only happens once}; namely, one typically only has access to a single trajectory of the data-generating process comprised of non-i.i.d. observations. We derive non-asymptotic statistical guarantees in this setting through bounds on the \textit{generalization} of a transformer network at a future-time $t$,…
▽ More
One of the inherent challenges in deploying transformers on time series is that \emph{reality only happens once}; namely, one typically only has access to a single trajectory of the data-generating process comprised of non-i.i.d. observations. We derive non-asymptotic statistical guarantees in this setting through bounds on the \textit{generalization} of a transformer network at a future-time $t$, given that it has been trained using $N\le t$ observations from a single perturbed trajectory of a Markov process. Under the assumption that the Markov process satisfies a log-Sobolev inequality, we obtain a generalization bound which effectively converges at the rate of ${O}(1/\sqrt{N})$. Our bound depends explicitly on the activation function ($\operatorname{Swish}$, $\operatorname{GeLU}$, or $\tanh$ are considered), the number of self-attention heads, depth, width, and norm-bounds defining the transformer architecture. Our bound consists of three components: (I) The first quantifies the gap between the stationary distribution of the data-generating Markov process and its distribution at time $t$, this term converges exponentially to $0$. (II) The next term encodes the complexity of the transformer model and, given enough time, eventually converges to $0$ at the rate ${O}(\log(N)^r/\sqrt{N})$ for any $r>0$. (III) The third term guarantees that the bound holds with probability at least $1$-$δ$, and converges at a rate of ${O}(\sqrt{\log(1/δ)}/\sqrt{N})$.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Mixture of Experts Soften the Curse of Dimensionality in Operator Learning
Authors:
Anastasis Kratsios,
Takashi Furuya,
Jose Antonio Lara Benitez,
Matti Lassas,
Maarten de Hoop
Abstract:
In this paper, we construct a mixture of neural operators (MoNOs) between function spaces whose complexity is distributed over a network of expert neural operators (NOs), with each NO satisfying parameter scaling restrictions. Our main result is a \textit{distributed} universal approximation theorem guaranteeing that any Lipschitz non-linear operator between $L^2([0,1]^d)$ spaces can be approximat…
▽ More
In this paper, we construct a mixture of neural operators (MoNOs) between function spaces whose complexity is distributed over a network of expert neural operators (NOs), with each NO satisfying parameter scaling restrictions. Our main result is a \textit{distributed} universal approximation theorem guaranteeing that any Lipschitz non-linear operator between $L^2([0,1]^d)$ spaces can be approximated uniformly over the Sobolev unit ball therein, to any given $\varepsilon>0$ accuracy, by an MoNO while satisfying the constraint that: each expert NO has a depth, width, and rank of $\mathcal{O}(\varepsilon^{-1})$. Naturally, our result implies that the required number of experts must be large, however, each NO is guaranteed to be small enough to be loadable into the active memory of most computers for reasonable accuracies $\varepsilon$. During our analysis, we also obtain new quantitative expression rates for classical NOs approximating uniformly continuous non-linear operators uniformly on compact subsets of $L^2([0,1]^d)$.
△ Less
Submitted 13 April, 2024;
originally announced April 2024.
-
Tighter Generalization Bounds on Digital Computers via Discrete Optimal Transport
Authors:
Anastasis Kratsios,
A. Martina Neuman,
Gudmund Pammer
Abstract:
Machine learning models with inputs in a Euclidean space $\mathbb{R}^d$, when implemented on digital computers, generalize, and their {\it generalization gap} converges to $0$ at a rate of $c/N^{1/2}$ concerning the sample size $N$. However, the constant $c>0$ obtained through classical methods can be large in terms of the ambient dimension $d$ and the machine precision, posing a challenge when…
▽ More
Machine learning models with inputs in a Euclidean space $\mathbb{R}^d$, when implemented on digital computers, generalize, and their {\it generalization gap} converges to $0$ at a rate of $c/N^{1/2}$ concerning the sample size $N$. However, the constant $c>0$ obtained through classical methods can be large in terms of the ambient dimension $d$ and the machine precision, posing a challenge when $N$ is small to realistically large. In this paper, we derive a family of generalization bounds $\{c_m/N^{1/(2\vee m)}\}_{m=1}^{\infty}$ tailored for learning models on digital computers, which adapt to both the sample size $N$ and the so-called geometric {\it representation dimension} $m$ of the discrete learning problem. Adjusting the parameter $m$ according to $N$ results in significantly tighter generalization bounds for practical sample sizes $N$, while setting $m$ small maintains the optimal dimension-free worst-case rate of $\mathcal{O}(1/N^{1/2})$. Notably, $c_{m}\in \mathcal{O}(\sqrt{m})$ for learning models on discretized Euclidean domains.
Furthermore, our adaptive generalization bounds are formulated based on our new non-asymptotic result for concentration of measure in discrete optimal transport, established via leveraging metric embedding arguments.
△ Less
Submitted 14 April, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
Approximation Rates and VC-Dimension Bounds for (P)ReLU MLP Mixture of Experts
Authors:
Anastasis Kratsios,
Haitz Sáez de Ocáriz Borde,
Takashi Furuya,
Marc T. Law
Abstract:
Mixture-of-Experts (MoEs) can scale up beyond traditional deep learning models by employing a routing strategy in which each input is processed by a single "expert" deep learning model. This strategy allows us to scale up the number of parameters defining the MoE while maintaining sparse activation, i.e., MoEs only load a small number of their total parameters into GPU VRAM for the forward pass de…
▽ More
Mixture-of-Experts (MoEs) can scale up beyond traditional deep learning models by employing a routing strategy in which each input is processed by a single "expert" deep learning model. This strategy allows us to scale up the number of parameters defining the MoE while maintaining sparse activation, i.e., MoEs only load a small number of their total parameters into GPU VRAM for the forward pass depending on the input. In this paper, we provide an approximation and learning-theoretic analysis of mixtures of expert MLPs with (P)ReLU activation functions. We first prove that for every error level $\varepsilon>0$ and every Lipschitz function $f:[0,1]^n\to \mathbb{R}$, one can construct a MoMLP model (a Mixture-of-Experts comprising of (P)ReLU MLPs) which uniformly approximates $f$ to $\varepsilon$ accuracy over $[0,1]^n$, while only requiring networks of $\mathcal{O}(\varepsilon^{-1})$ parameters to be loaded in memory. Additionally, we show that MoMLPs can generalize since the entire MoMLP model has a (finite) VC dimension of $\tilde{O}(L\max\{nL,JW\})$, if there are $L$ experts and each expert has a depth and width of $J$ and $W$, respectively.
△ Less
Submitted 25 May, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
Characterizing Overfitting in Kernel Ridgeless Regression Through the Eigenspectrum
Authors:
Tin Sum Cheng,
Aurelien Lucchi,
Anastasis Kratsios,
David Belius
Abstract:
We derive new bounds for the condition number of kernel matrices, which we then use to enhance existing non-asymptotic test error bounds for kernel ridgeless regression (KRR) in the over-parameterized regime for a fixed input dimension. For kernels with polynomial spectral decay, we recover the bound from previous work; for exponential decay, our bound is non-trivial and novel. Our contribution is…
▽ More
We derive new bounds for the condition number of kernel matrices, which we then use to enhance existing non-asymptotic test error bounds for kernel ridgeless regression (KRR) in the over-parameterized regime for a fixed input dimension. For kernels with polynomial spectral decay, we recover the bound from previous work; for exponential decay, our bound is non-trivial and novel. Our contribution is two-fold: (i) we rigorously prove the phenomena of tempered overfitting and catastrophic overfitting under the sub-Gaussian design assumption, closing an existing gap in the literature; (ii) we identify that the independence of the features plays an important role in guaranteeing tempered overfitting, raising concerns about approximating KRR generalization using the Gaussian design assumption in previous literature.
△ Less
Submitted 29 May, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
Deep Kalman Filters Can Filter
Authors:
Blanka Hovart,
Anastasis Kratsios,
Yannick Limmer,
Xuwei Yang
Abstract:
Deep Kalman filters (DKFs) are a class of neural network models that generate Gaussian probability measures from sequential data. Though DKFs are inspired by the Kalman filter, they lack concrete theoretical ties to the stochastic filtering problem, thus limiting their applicability to areas where traditional model-based filters have been used, e.g.\ model calibration for bond and option prices in…
▽ More
Deep Kalman filters (DKFs) are a class of neural network models that generate Gaussian probability measures from sequential data. Though DKFs are inspired by the Kalman filter, they lack concrete theoretical ties to the stochastic filtering problem, thus limiting their applicability to areas where traditional model-based filters have been used, e.g.\ model calibration for bond and option prices in mathematical finance. We address this issue in the mathematical foundations of deep learning by exhibiting a class of continuous-time DKFs which can approximately implement the conditional law of a broad class of non-Markovian and conditionally Gaussian signal processes given noisy continuous-times measurements. Our approximation results hold uniformly over sufficiently regular compact subsets of paths, where the approximation error is quantified by the worst-case 2-Wasserstein distance computed uniformly over the given compact set of paths.
△ Less
Submitted 30 October, 2023;
originally announced October 2023.
-
Neural Snowflakes: Universal Latent Graph Inference via Trainable Latent Geometries
Authors:
Haitz Sáez de Ocáriz Borde,
Anastasis Kratsios
Abstract:
The inductive bias of a graph neural network (GNN) is largely encoded in its specified graph. Latent graph inference relies on latent geometric representations to dynamically rewire or infer a GNN's graph to maximize the GNN's predictive downstream performance, but it lacks solid theoretical foundations in terms of embedding-based representation guarantees. This paper addresses this issue by intro…
▽ More
The inductive bias of a graph neural network (GNN) is largely encoded in its specified graph. Latent graph inference relies on latent geometric representations to dynamically rewire or infer a GNN's graph to maximize the GNN's predictive downstream performance, but it lacks solid theoretical foundations in terms of embedding-based representation guarantees. This paper addresses this issue by introducing a trainable deep learning architecture, coined neural snowflake, that can adaptively implement fractal-like metrics on $\mathbb{R}^d$. We prove that any given finite weights graph can be isometrically embedded by a standard MLP encoder. Furthermore, when the latent graph can be represented in the feature space of a sufficiently regular kernel, we show that the combined neural snowflake and MLP encoder do not succumb to the curse of dimensionality by using only a low-degree polynomial number of parameters in the number of nodes. This implementation enables a low-dimensional isometric embedding of the latent graph. We conduct synthetic experiments to demonstrate the superior metric learning capabilities of neural snowflakes when compared to more familiar spaces like Euclidean space. Additionally, we carry out latent graph inference experiments on graph benchmarks. Consistently, the neural snowflake model achieves predictive performance that either matches or surpasses that of the state-of-the-art latent graph inference models. Importantly, this performance improvement is achieved without requiring random search for optimal latent geometry. Instead, the neural snowflake model achieves this enhancement in a differentiable manner.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Energy-Guided Continuous Entropic Barycenter Estimation for General Costs
Authors:
Alexander Kolesov,
Petr Mokrov,
Igor Udovichenko,
Milena Gazdieva,
Gudmund Pammer,
Anastasis Kratsios,
Evgeny Burnaev,
Alexander Korotin
Abstract:
Optimal transport (OT) barycenters are a mathematically grounded way of averaging probability distributions while capturing their geometric properties. In short, the barycenter task is to take the average of a collection of probability distributions w.r.t. given OT discrepancies. We propose a novel algorithm for approximating the continuous Entropic OT (EOT) barycenter for arbitrary OT cost functi…
▽ More
Optimal transport (OT) barycenters are a mathematically grounded way of averaging probability distributions while capturing their geometric properties. In short, the barycenter task is to take the average of a collection of probability distributions w.r.t. given OT discrepancies. We propose a novel algorithm for approximating the continuous Entropic OT (EOT) barycenter for arbitrary OT cost functions. Our approach is built upon the dual reformulation of the EOT problem based on weak OT, which has recently gained the attention of the ML community. Beyond its novelty, our method enjoys several advantageous properties: (i) we establish quality bounds for the recovered solution; (ii) this approach seamlessly interconnects with the Energy-Based Models (EBMs) learning procedure enabling the use of well-tuned algorithms for the problem of interest; (iii) it provides an intuitive optimization scheme avoiding min-max, reinforce and other intricate technical tricks. For validation, we consider several low-dimensional scenarios and image-space setups, including non-Euclidean cost functions. Furthermore, we investigate the practical task of learning the barycenter on an image manifold generated by a pretrained generative model, opening up new directions for real-world applications.
△ Less
Submitted 27 May, 2024; v1 submitted 2 October, 2023;
originally announced October 2023.
-
A Theoretical Analysis of the Test Error of Finite-Rank Kernel Ridge Regression
Authors:
Tin Sum Cheng,
Aurelien Lucchi,
Ivan Dokmanić,
Anastasis Kratsios,
David Belius
Abstract:
Existing statistical learning guarantees for general kernel regressors often yield loose bounds when used with finite-rank kernels. Yet, finite-rank kernels naturally appear in several machine learning problems, e.g.\ when fine-tuning a pre-trained deep neural network's last layer to adapt it to a novel task when performing transfer learning. We address this gap for finite-rank kernel ridge regres…
▽ More
Existing statistical learning guarantees for general kernel regressors often yield loose bounds when used with finite-rank kernels. Yet, finite-rank kernels naturally appear in several machine learning problems, e.g.\ when fine-tuning a pre-trained deep neural network's last layer to adapt it to a novel task when performing transfer learning. We address this gap for finite-rank kernel ridge regression (KRR) by deriving sharp non-asymptotic upper and lower bounds for the KRR test error of any finite-rank KRR. Our bounds are tighter than previously derived bounds on finite-rank KRR, and unlike comparable results, they also remain valid for any regularization parameters.
△ Less
Submitted 3 October, 2023; v1 submitted 2 October, 2023;
originally announced October 2023.
-
Regret-Optimal Federated Transfer Learning for Kernel Regression with Applications in American Option Pricing
Authors:
Xuwei Yang,
Anastasis Kratsios,
Florian Krach,
Matheus Grasselli,
Aurelien Lucchi
Abstract:
We propose an optimal iterative scheme for federated transfer learning, where a central planner has access to datasets ${\cal D}_1,\dots,{\cal D}_N$ for the same learning model $f_θ$. Our objective is to minimize the cumulative deviation of the generated parameters $\{θ_i(t)\}_{t=0}^T$ across all $T$ iterations from the specialized parameters $θ^\star_{1},\ldots,θ^\star_N$ obtained for each datase…
▽ More
We propose an optimal iterative scheme for federated transfer learning, where a central planner has access to datasets ${\cal D}_1,\dots,{\cal D}_N$ for the same learning model $f_θ$. Our objective is to minimize the cumulative deviation of the generated parameters $\{θ_i(t)\}_{t=0}^T$ across all $T$ iterations from the specialized parameters $θ^\star_{1},\ldots,θ^\star_N$ obtained for each dataset, while respecting the loss function for the model $f_{θ(T)}$ produced by the algorithm upon halting. We only allow for continual communication between each of the specialized models (nodes/agents) and the central planner (server), at each iteration (round). For the case where the model $f_θ$ is a finite-rank kernel regression, we derive explicit updates for the regret-optimal algorithm. By leveraging symmetries within the regret-optimal algorithm, we further develop a nearly regret-optimal heuristic that runs with $\mathcal{O}(Np^2)$ fewer elementary operations, where $p$ is the dimension of the parameter space. Additionally, we investigate the adversarial robustness of the regret-optimal algorithm showing that an adversary which perturbs $q$ training pairs by at-most $\varepsilon>0$, across all training sets, cannot reduce the regret-optimal algorithm's regret by more than $\mathcal{O}(\varepsilon q \bar{N}^{1/2})$, where $\bar{N}$ is the aggregate number of training pairs. To validate our theoretical findings, we conduct numerical experiments in the context of American option pricing, utilizing a randomly generated finite-rank kernel.
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
Capacity Bounds for Hyperbolic Neural Network Representations of Latent Tree Structures
Authors:
Anastasis Kratsios,
Ruiyang Hong,
Haitz Sáez de Ocáriz Borde
Abstract:
We study the representation capacity of deep hyperbolic neural networks (HNNs) with a ReLU activation function. We establish the first proof that HNNs can $\varepsilon$-isometrically embed any finite weighted tree into a hyperbolic space of dimension $d$ at least equal to $2$ with prescribed sectional curvature $κ<0$, for any $\varepsilon> 1$ (where $\varepsilon=1$ being optimal). We establish rig…
▽ More
We study the representation capacity of deep hyperbolic neural networks (HNNs) with a ReLU activation function. We establish the first proof that HNNs can $\varepsilon$-isometrically embed any finite weighted tree into a hyperbolic space of dimension $d$ at least equal to $2$ with prescribed sectional curvature $κ<0$, for any $\varepsilon> 1$ (where $\varepsilon=1$ being optimal). We establish rigorous upper bounds for the network complexity on an HNN implementing the embedding. We find that the network complexity of HNN implementing the graph representation is independent of the representation fidelity/distortion. We contrast this result against our lower bounds on distortion which any ReLU multi-layer perceptron (MLP) must exert when embedding a tree with $L>2^d$ leaves into a $d$-dimensional Euclidean space, which we show at least $Ω(L^{1/d})$; independently of the depth, width, and (possibly discontinuous) activation function defining the MLP.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
An Approximation Theory for Metric Space-Valued Functions With A View Towards Deep Learning
Authors:
Anastasis Kratsios,
Chong Liu,
Matti Lassas,
Maarten V. de Hoop,
Ivan Dokmanić
Abstract:
Motivated by the develo** mathematics of deep learning, we build universal functions approximators of continuous maps between arbitrary Polish metric spaces $\mathcal{X}$ and $\mathcal{Y}$ using elementary functions between Euclidean spaces as building blocks. Earlier results assume that the target space $\mathcal{Y}$ is a topological vector space. We overcome this limitation by ``randomization'…
▽ More
Motivated by the develo** mathematics of deep learning, we build universal functions approximators of continuous maps between arbitrary Polish metric spaces $\mathcal{X}$ and $\mathcal{Y}$ using elementary functions between Euclidean spaces as building blocks. Earlier results assume that the target space $\mathcal{Y}$ is a topological vector space. We overcome this limitation by ``randomization'': our approximators output discrete probability measures over $\mathcal{Y}$. When $\mathcal{X}$ and $\mathcal{Y}$ are Polish without additional structure, we prove very general qualitative guarantees; when they have suitable combinatorial structure, we prove quantitative guarantees for Hölder-like maps, including maps between finite graphs, solution operators to rough differential equations between certain Carnot groups, and continuous non-linear operators between Banach spaces arising in inverse problems. In particular, we show that the required number of Dirac measures is determined by the combinatorial structure of $\mathcal{X}$ and $\mathcal{Y}$. For barycentric $\mathcal{Y}$, including Banach spaces, $\mathbb{R}$-trees, Hadamard manifolds, or Wasserstein spaces on Polish metric spaces, our approximators reduce to $\mathcal{Y}$-valued functions. When the Euclidean approximators are neural networks, our constructions generalize transformer networks, providing a new probabilistic viewpoint of geometric deep learning.
△ Less
Submitted 24 July, 2023; v1 submitted 24 April, 2023;
originally announced April 2023.
-
Generative Ornstein-Uhlenbeck Markets via Geometric Deep Learning
Authors:
Anastasis Kratsios,
Cody Hyndman
Abstract:
We consider the problem of simultaneously approximating the conditional distribution of market prices and their log returns with a single machine learning model. We show that an instance of the GDN model of Kratsios and Papon (2022) solves this problem without having prior assumptions on the market's "clipped" log returns, other than that they follow a generalized Ornstein-Uhlenbeck process with a…
▽ More
We consider the problem of simultaneously approximating the conditional distribution of market prices and their log returns with a single machine learning model. We show that an instance of the GDN model of Kratsios and Papon (2022) solves this problem without having prior assumptions on the market's "clipped" log returns, other than that they follow a generalized Ornstein-Uhlenbeck process with a priori unknown dynamics. We provide universal approximation guarantees for these conditional distributions and contingent claims with a Lipschitz payoff function.
△ Less
Submitted 17 February, 2023;
originally announced February 2023.
-
Out-of-distributional risk bounds for neural operators with applications to the Helmholtz equation
Authors:
J. Antonio Lara Benitez,
Takashi Furuya,
Florian Faucher,
Anastasis Kratsios,
Xavier Tricoche,
Maarten V. de Hoop
Abstract:
Despite their remarkable success in approximating a wide range of operators defined by PDEs, existing neural operators (NOs) do not necessarily perform well for all physics problems. We focus here on high-frequency waves to highlight possible shortcomings. To resolve these, we propose a subfamily of NOs enabling an enhanced empirical approximation of the nonlinear operator map** wave speed to so…
▽ More
Despite their remarkable success in approximating a wide range of operators defined by PDEs, existing neural operators (NOs) do not necessarily perform well for all physics problems. We focus here on high-frequency waves to highlight possible shortcomings. To resolve these, we propose a subfamily of NOs enabling an enhanced empirical approximation of the nonlinear operator map** wave speed to solution, or boundary values for the Helmholtz equation on a bounded domain. The latter operator is commonly referred to as the ''forward'' operator in the study of inverse problems. Our methodology draws inspiration from transformers and techniques such as stochastic depth. Our experiments reveal certain surprises in the generalization and the relevance of introducing stochastic depth. Our NOs show superior performance as compared with standard NOs, not only for testing within the training distribution but also for out-of-distribution scenarios. To delve into this observation, we offer an in-depth analysis of the Rademacher complexity associated with our modified models and prove an upper bound tied to their stochastic depth that existing NOs do not satisfy. Furthermore, we obtain a novel out-of-distribution risk bound tailored to Gaussian measures on Banach spaces, again relating stochastic depth with the bound. We conclude by proposing a hypernetwork version of the subfamily of NOs as a surrogate model for the mentioned forward operator.
△ Less
Submitted 4 July, 2023; v1 submitted 26 January, 2023;
originally announced January 2023.
-
Instance-Dependent Generalization Bounds via Optimal Transport
Authors:
Songyan Hou,
Parnian Kassraie,
Anastasis Kratsios,
Andreas Krause,
Jonas Rothfuss
Abstract:
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks. Since such bounds often hold uniformly over all parameters, they suffer from over-parametrization and fail to account for the strong inductive bias of initialization and stochastic gradient descent. As an alternative, we propose a novel optimal transport interpretation of the gen…
▽ More
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks. Since such bounds often hold uniformly over all parameters, they suffer from over-parametrization and fail to account for the strong inductive bias of initialization and stochastic gradient descent. As an alternative, we propose a novel optimal transport interpretation of the generalization problem. This allows us to derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space. Therefore, our bounds are agnostic to the parametrization of the model and work well when the number of training samples is much smaller than the number of parameters. With small modifications, our approach yields accelerated rates for data on low-dimensional manifolds and guarantees under distribution shifts. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
△ Less
Submitted 13 November, 2023; v1 submitted 2 November, 2022;
originally announced November 2022.
-
Designing Universal Causal Deep Learning Models: The Case of Infinite-Dimensional Dynamical Systems from Stochastic Analysis
Authors:
Luca Galimberti,
Anastasis Kratsios,
Giulia Livieri
Abstract:
Causal operators (CO), such as various solution operators to stochastic differential equations, play a central role in contemporary stochastic analysis; however, there is still no canonical framework for designing Deep Learning (DL) models capable of approximating COs. This paper proposes a "geometry-aware'" solution to this open problem by introducing a DL model-design framework that takes suitab…
▽ More
Causal operators (CO), such as various solution operators to stochastic differential equations, play a central role in contemporary stochastic analysis; however, there is still no canonical framework for designing Deep Learning (DL) models capable of approximating COs. This paper proposes a "geometry-aware'" solution to this open problem by introducing a DL model-design framework that takes suitable infinite-dimensional linear metric spaces as inputs and returns a universal sequential DL model adapted to these linear geometries. We call these models Causal Neural Operators (CNOs). Our main result states that the models produced by our framework can uniformly approximate on compact sets and across arbitrarily finite-time horizons Hölder or smooth trace class operators, which causally map sequences between given linear metric spaces. Our analysis uncovers new quantitative relationships on the latent state-space dimension of CNOs which even have new implications for (classical) finite-dimensional Recurrent Neural Networks (RNNs). We find that a linear increase of the CNO's (or RNN's) latent parameter space's dimension and of its width, and a logarithmic increase of its depth imply an exponential increase in the number of time steps for which its approximation remains valid. A direct consequence of our analysis shows that RNNs can approximate causal functions using exponentially fewer parameters than ReLU networks.
△ Less
Submitted 9 May, 2023; v1 submitted 24 October, 2022;
originally announced October 2022.
-
Small Transformers Compute Universal Metric Embeddings
Authors:
Anastasis Kratsios,
Valentin Debarnot,
Ivan Dokmanić
Abstract:
We study representations of data from an arbitrary metric space $\mathcal{X}$ in the space of univariate Gaussian mixtures with a transport metric (Delon and Desolneux 2020). We derive embedding guarantees for feature maps implemented by small neural networks called \emph{probabilistic transformers}. Our guarantees are of memorization type: we prove that a probabilistic transformer of depth about…
▽ More
We study representations of data from an arbitrary metric space $\mathcal{X}$ in the space of univariate Gaussian mixtures with a transport metric (Delon and Desolneux 2020). We derive embedding guarantees for feature maps implemented by small neural networks called \emph{probabilistic transformers}. Our guarantees are of memorization type: we prove that a probabilistic transformer of depth about $n\log(n)$ and width about $n^2$ can bi-Hölder embed any $n$-point dataset from $\mathcal{X}$ with low metric distortion, thus avoiding the curse of dimensionality. We further derive probabilistic bi-Lipschitz guarantees, which trade off the amount of distortion and the probability that a randomly chosen pair of points embeds with that distortion. If $\mathcal{X}$'s geometry is sufficiently regular, we obtain stronger, bi-Lipschitz guarantees for all points in the dataset. As applications, we derive neural embedding guarantees for datasets from Riemannian manifolds, metric trees, and certain types of combinatorial graphs. When instead embedding into multivariate Gaussian mixtures, we show that probabilistic transformers can compute bi-Hölder embeddings with arbitrarily small distortion.
△ Less
Submitted 18 October, 2022; v1 submitted 14 September, 2022;
originally announced September 2022.
-
Do ReLU Networks Have An Edge When Approximating Compactly-Supported Functions?
Authors:
Anastasis Kratsios,
Behnoosh Zamanlooy
Abstract:
We study the problem of approximating compactly-supported integrable functions while implementing their support set using feedforward neural networks. Our first main result transcribes this "structured" approximation problem into a universality problem. We do this by constructing a refinement of the usual topology on the space $L^1_{\operatorname{loc}}(\mathbb{R}^d,\mathbb{R}^D)$ of locally-integr…
▽ More
We study the problem of approximating compactly-supported integrable functions while implementing their support set using feedforward neural networks. Our first main result transcribes this "structured" approximation problem into a universality problem. We do this by constructing a refinement of the usual topology on the space $L^1_{\operatorname{loc}}(\mathbb{R}^d,\mathbb{R}^D)$ of locally-integrable functions in which compactly-supported functions can only be approximated in $L^1$-norm by functions with matching discretized support. We establish the universality of ReLU feedforward networks with bilinear pooling layers in this refined topology. Consequentially, we find that ReLU feedforward networks with bilinear pooling can approximate compactly supported functions while implementing their discretized support. We derive a quantitative uniform version of our universal approximation theorem on the dense subclass of compactly-supported Lipschitz functions. This quantitative result expresses the depth, width, and the number of bilinear pooling layers required to construct this ReLU network via the target function's regularity, the metric capacity and diameter of its essential support, and the dimensions of the inputs and output spaces. Conversely, we show that polynomial regressors and analytic feedforward networks are not universal in this space.
△ Less
Submitted 1 August, 2022; v1 submitted 24 April, 2022;
originally announced April 2022.
-
Designing Universal Causal Deep Learning Models: The Geometric (Hyper)Transformer
Authors:
Beatrice Acciaio,
Anastasis Kratsios,
Gudmund Pammer
Abstract:
Several problems in stochastic analysis are defined through their geometry, and preserving that geometric structure is essential to generating meaningful predictions. Nevertheless, how to design principled deep learning (DL) models capable of encoding these geometric structures remains largely unknown. We address this open problem by introducing a universal causal geometric DL framework in which t…
▽ More
Several problems in stochastic analysis are defined through their geometry, and preserving that geometric structure is essential to generating meaningful predictions. Nevertheless, how to design principled deep learning (DL) models capable of encoding these geometric structures remains largely unknown. We address this open problem by introducing a universal causal geometric DL framework in which the user specifies a suitable pair of metric spaces $\mathscr{X}$ and $\mathscr{Y}$ and our framework returns a DL model capable of causally approximating any ``regular'' map sending time series in $\mathscr{X}^{\mathbb{Z}}$ to time series in $\mathscr{Y}^{\mathbb{Z}}$ while respecting their forward flow of information throughout time. Suitable geometries on $\mathscr{Y}$ include various (adapted) Wasserstein spaces arising in optimal stop** problems, a variety of statistical manifolds describing the conditional distribution of continuous-time finite state Markov chains, and all Fréchet spaces admitting a Schauder basis, e.g. as in classical finance. Suitable spaces $\mathscr{X}$ are compact subsets of any Euclidean space. Our results all quantitatively express the number of parameters needed for our DL model to achieve a given approximation error as a function of the target map's regularity and the geometric structure both of $\mathscr{X}$ and of $\mathscr{Y}$. Even when omitting any temporal structure, our universal approximation theorems are the first guarantees that Hölder functions, defined between such $\mathscr{X}$ and $\mathscr{Y}$ can be approximated by DL models.
△ Less
Submitted 9 March, 2023; v1 submitted 31 January, 2022;
originally announced January 2022.
-
Universal Approximation Under Constraints is Possible with Transformers
Authors:
Anastasis Kratsios,
Behnoosh Zamanlooy,
Tianlin Liu,
Ivan Dokmanić
Abstract:
Many practical problems need the output of a machine learning model to satisfy a set of constraints, $K$. Nevertheless, there is no known guarantee that classical neural network architectures can exactly encode constraints while simultaneously achieving universality. We provide a quantitative constrained universal approximation theorem which guarantees that for any non-convex compact set $K$ and a…
▽ More
Many practical problems need the output of a machine learning model to satisfy a set of constraints, $K$. Nevertheless, there is no known guarantee that classical neural network architectures can exactly encode constraints while simultaneously achieving universality. We provide a quantitative constrained universal approximation theorem which guarantees that for any non-convex compact set $K$ and any continuous function $f:\mathbb{R}^n\rightarrow K$, there is a probabilistic transformer $\hat{F}$ whose randomized outputs all lie in $K$ and whose expected output uniformly approximates $f$. Our second main result is a "deep neural version" of Berge's Maximum Theorem (1963). The result guarantees that given an objective function $L$, a constraint set $K$, and a family of soft constraint sets, there is a probabilistic transformer $\hat{F}$ that approximately minimizes $L$ and whose outputs belong to $K$; moreover, $\hat{F}$ approximately satisfies the soft constraints. Our results imply the first universal approximation theorem for classical transformers with exact convex constraint satisfaction. They also yield that a chart-free universal approximation theorem for Riemannian manifold-valued functions subject to suitable geodesically convex constraints.
△ Less
Submitted 8 February, 2022; v1 submitted 7 October, 2021;
originally announced October 2021.
-
Universal Regular Conditional Distributions
Authors:
Anastasis Kratsios
Abstract:
We introduce a deep learning model that can universally approximate regular conditional distributions (RCDs). The proposed model operates in three phases: first, it linearizes inputs from a given metric space $\mathcal{X}$ to $\mathbb{R}^d$ via a feature map, then a deep feedforward neural network processes these linearized features, and then the network's outputs are then transformed to the $1$-W…
▽ More
We introduce a deep learning model that can universally approximate regular conditional distributions (RCDs). The proposed model operates in three phases: first, it linearizes inputs from a given metric space $\mathcal{X}$ to $\mathbb{R}^d$ via a feature map, then a deep feedforward neural network processes these linearized features, and then the network's outputs are then transformed to the $1$-Wasserstein space $\mathcal{P}_1(\mathbb{R}^D)$ via a probabilistic extension of the attention mechanism of Bahdanau et al.\ (2014). Our model, called the \textit{probabilistic transformer (PT)}, can approximate any continuous function from $\mathbb{R}^d $ to $\mathcal{P}_1(\mathbb{R}^D)$ uniformly on compact sets, quantitatively. We identify two ways in which the PT avoids the curse of dimensionality when approximating $\mathcal{P}_1(\mathbb{R}^D)$-valued functions. The first strategy builds functions in $C(\mathbb{R}^d,\mathcal{P}_1(\mathbb{R}^D))$ which can be efficiently approximated by a PT, uniformly on any given compact subset of $\mathbb{R}^d$. In the second approach, given any function $f$ in $C(\mathbb{R}^d,\mathcal{P}_1(\mathbb{R}^D))$, we build compact subsets of $\mathbb{R}^d$ whereon $f$ can be efficiently approximated by a PT.
△ Less
Submitted 23 February, 2023; v1 submitted 17 May, 2021;
originally announced May 2021.
-
Universal Approximation Theorems for Differentiable Geometric Deep Learning
Authors:
Anastasis Kratsios,
Leonie Papon
Abstract:
This paper addresses the growing need to process non-Euclidean data, by introducing a geometric deep learning (GDL) framework for building universal feedforward-type models compatible with differentiable manifold geometries. We show that our GDL models can approximate any continuous target function uniformly on compact sets of a controlled maximum diameter. We obtain curvature-dependent lower-boun…
▽ More
This paper addresses the growing need to process non-Euclidean data, by introducing a geometric deep learning (GDL) framework for building universal feedforward-type models compatible with differentiable manifold geometries. We show that our GDL models can approximate any continuous target function uniformly on compact sets of a controlled maximum diameter. We obtain curvature-dependent lower-bounds on this maximum diameter and upper-bounds on the depth of our approximating GDL models. Conversely, we find that there is always a continuous function between any two non-degenerate compact manifolds that any "locally-defined" GDL model cannot uniformly approximate. Our last main result identifies data-dependent conditions guaranteeing that the GDL model implementing our approximation breaks "the curse of dimensionality." We find that any "real-world" (i.e. finite) dataset always satisfies our condition and, conversely, any dataset satisfies our requirement if the target function is smooth. As applications, we confirm the universal approximation capabilities of the following GDL models: Ganea et al. (2018)'s hyperbolic feedforward networks, the architecture implementing Krishnan et al. (2015)'s deep Kalman-Filter, and deep softmax classifiers. We build universal extensions/variants of: the SPD-matrix regressor of Meyer et al. (2011), and Fletcher (2003)'s Procrustean regressor. In the Euclidean setting, our results imply a quantitative version of Kidger and Lyons (2020)'s approximation theorem and a data-dependent version of Yarotsky and Zhevnerchuk (2019)'s uncursed approximation rates.
△ Less
Submitted 25 July, 2022; v1 submitted 13 January, 2021;
originally announced January 2021.
-
Optimizing Optimizers: Regret-optimal gradient descent algorithms
Authors:
Philippe Casgrain,
Anastasis Kratsios
Abstract:
The need for fast and robust optimization algorithms are of critical importance in all areas of machine learning. This paper treats the task of designing optimization algorithms as an optimal control problem. Using regret as a metric for an algorithm's performance, we study the existence, uniqueness and consistency of regret-optimal algorithms. By providing first-order optimality conditions for th…
▽ More
The need for fast and robust optimization algorithms are of critical importance in all areas of machine learning. This paper treats the task of designing optimization algorithms as an optimal control problem. Using regret as a metric for an algorithm's performance, we study the existence, uniqueness and consistency of regret-optimal algorithms. By providing first-order optimality conditions for the control problem, we show that regret-optimal algorithms must satisfy a specific structure in their dynamics which we show is equivalent to performing dual-preconditioned gradient descent on the value function generated by its regret. Using these optimal dynamics, we provide bounds on their rates of convergence to solutions of convex optimization problems. Though closed-form optimal dynamics cannot be obtained in general, we present fast numerical methods for approximating them, generating optimization algorithms which directly optimize their long-term regret. Lastly, these are benchmarked against commonly used optimization algorithms to demonstrate their effectiveness.
△ Less
Submitted 19 January, 2021; v1 submitted 31 December, 2020;
originally announced January 2021.
-
Learning Sub-Patterns in Piecewise Continuous Functions
Authors:
Anastasis Kratsios,
Behnoosh Zamanlooy
Abstract:
Most stochastic gradient descent algorithms can optimize neural networks that are sub-differentiable in their parameters; however, this implies that the neural network's activation function must exhibit a degree of continuity which limits the neural network model's uniform approximation capacity to continuous functions. This paper focuses on the case where the discontinuities arise from distinct s…
▽ More
Most stochastic gradient descent algorithms can optimize neural networks that are sub-differentiable in their parameters; however, this implies that the neural network's activation function must exhibit a degree of continuity which limits the neural network model's uniform approximation capacity to continuous functions. This paper focuses on the case where the discontinuities arise from distinct sub-patterns, each defined on different parts of the input space. We propose a new discontinuous deep neural network model trainable via a decoupled two-step procedure that avoids passing gradient updates through the network's only and strategically placed, discontinuous unit. We provide approximation guarantees for our architecture in the space of bounded continuous functions and universal approximation guarantees in the space of piecewise continuous functions which we introduced herein. We present a novel semi-supervised two-step training procedure for our discontinuous deep learning model, tailored to its structure, and we provide theoretical support for its effectiveness. The performance of our model and trained with the propose procedure is evaluated experimentally on both real-world financial datasets and synthetic datasets.
△ Less
Submitted 15 December, 2021; v1 submitted 29 October, 2020;
originally announced October 2020.
-
A Canonical Transform for Strengthening the Local $L^p$-Type Universal Approximation Property
Authors:
Anastasis Kratsios,
Behnoosh Zamanlooy
Abstract:
Most $L^p$-type universal approximation theorems guarantee that a given machine learning model class $\mathscr{F}\subseteq C(\mathbb{R}^d,\mathbb{R}^D)$ is dense in $L^p_μ(\mathbb{R}^d,\mathbb{R}^D)$ for any suitable finite Borel measure $μ$ on $\mathbb{R}^d$. Unfortunately, this means that the model's approximation quality can rapidly degenerate outside some compact subset of $\mathbb{R}^d$, as a…
▽ More
Most $L^p$-type universal approximation theorems guarantee that a given machine learning model class $\mathscr{F}\subseteq C(\mathbb{R}^d,\mathbb{R}^D)$ is dense in $L^p_μ(\mathbb{R}^d,\mathbb{R}^D)$ for any suitable finite Borel measure $μ$ on $\mathbb{R}^d$. Unfortunately, this means that the model's approximation quality can rapidly degenerate outside some compact subset of $\mathbb{R}^d$, as any such measure is largely concentrated on some bounded subset of $\mathbb{R}^d$. This paper proposes a generic solution to this approximation theoretic problem by introducing a canonical transformation which "upgrades $\mathscr{F}$'s approximation property" in the following sense. The transformed model class, denoted by $\mathscr{F}\text{-tope}$, is shown to be dense in $L^p_{μ,\text{strict}}(\mathbb{R}^d,\mathbb{R}^D)$ which is a topological space whose elements are locally $p$-integrable functions and whose topology is much finer than usual norm topology on $L^p_μ(\mathbb{R}^d,\mathbb{R}^D)$; here $μ$ is any suitable $σ$-finite Borel measure $μ$ on $\mathbb{R}^d$. Next, we show that if $\mathscr{F}$ is any family of analytic functions then there is always a strict "gap" between $\mathscr{F}\text{-tope}$'s expressibility and that of $\mathscr{F}$, since we find that $\mathscr{F}$ can never dense in $L^p_{μ,\text{strict}}(\mathbb{R}^d,\mathbb{R}^D)$. In the general case, where $\mathscr{F}$ may contain non-analytic functions, we provide an abstract form of these results guaranteeing that there always exists some function space in which $\mathscr{F}\text{-tope}$ is dense but $\mathscr{F}$ is not, while, the converse is never possible. Applications to feedforward networks, convolutional neural networks, and polynomial bases are explored.
△ Less
Submitted 9 June, 2021; v1 submitted 24 June, 2020;
originally announced June 2020.
-
Non-Euclidean Universal Approximation
Authors:
Anastasis Kratsios,
Eugene Bilokopytov
Abstract:
Modifications to a neural network's input and output layers are often required to accommodate the specificities of most practical learning tasks. However, the impact of such changes on architecture's approximation capabilities is largely not understood. We present general conditions describing feature and readout maps that preserve an architecture's ability to approximate any continuous functions…
▽ More
Modifications to a neural network's input and output layers are often required to accommodate the specificities of most practical learning tasks. However, the impact of such changes on architecture's approximation capabilities is largely not understood. We present general conditions describing feature and readout maps that preserve an architecture's ability to approximate any continuous functions uniformly on compacts. As an application, we show that if an architecture is capable of universal approximation, then modifying its final layer to produce binary values creates a new architecture capable of deterministically approximating any classifier. In particular, we obtain guarantees for deep CNNs and deep feed-forward networks. Our results also have consequences within the scope of geometric deep learning. Specifically, when the input and output spaces are Cartan-Hadamard manifolds, we obtain geometrically meaningful feature and readout maps satisfying our criteria. Consequently, commonly used non-Euclidean regression models between spaces of symmetric positive definite matrices are extended to universal DNNs. The same result allows us to show that the hyperbolic feed-forward networks, used for hierarchical learning, are universal. Our result is also used to show that the common practice of randomizing all but the last two layers of a DNN produces a universal family of functions with probability one. We also provide conditions on a DNN's first (resp. last) few layer's connections and activation function which guarantee that these layers can have a width equal to the input (resp. output) space's dimension while not negatively affecting the architecture's approximation capabilities.
△ Less
Submitted 7 November, 2020; v1 submitted 3 June, 2020;
originally announced June 2020.
-
Denise: Deep Robust Principal Component Analysis for Positive Semidefinite Matrices
Authors:
Calypso Herrera,
Florian Krach,
Anastasis Kratsios,
Pierre Ruyssen,
Josef Teichmann
Abstract:
The robust PCA of covariance matrices plays an essential role when isolating key explanatory features. The currently available methods for performing such a low-rank plus sparse decomposition are matrix specific, meaning, those algorithms must re-run for every new matrix. Since these algorithms are computationally expensive, it is preferable to learn and store a function that nearly instantaneousl…
▽ More
The robust PCA of covariance matrices plays an essential role when isolating key explanatory features. The currently available methods for performing such a low-rank plus sparse decomposition are matrix specific, meaning, those algorithms must re-run for every new matrix. Since these algorithms are computationally expensive, it is preferable to learn and store a function that nearly instantaneously performs this decomposition when evaluated. Therefore, we introduce Denise, a deep learning-based algorithm for robust PCA of covariance matrices, or more generally, of symmetric positive semidefinite matrices, which learns precisely such a function. Theoretical guarantees for Denise are provided. These include a novel universal approximation theorem adapted to our geometric deep learning problem and convergence to an optimal solution to the learning problem. Our experiments show that Denise matches state-of-the-art performance in terms of decomposition quality, while being approximately $2000\times$ faster than the state-of-the-art, principal component pursuit (PCP), and $200 \times$ faster than the current speed-optimized method, fast PCP.
△ Less
Submitted 6 June, 2023; v1 submitted 28 April, 2020;
originally announced April 2020.
-
The Universal Approximation Property
Authors:
Anastasis Kratsios
Abstract:
The universal approximation property of various machine learning models is currently only understood on a case-by-case basis, limiting the rapid development of new theoretically justified neural network architectures and blurring our understanding of our current models' potential. This paper works towards overcoming these challenges by presenting a characterization, a representation, a constructio…
▽ More
The universal approximation property of various machine learning models is currently only understood on a case-by-case basis, limiting the rapid development of new theoretically justified neural network architectures and blurring our understanding of our current models' potential. This paper works towards overcoming these challenges by presenting a characterization, a representation, a construction method, and an existence result, each of which applies to any universal approximator on most function spaces of practical interest. Our characterization result is used to describe which activation functions allow the feed-forward architecture to maintain its universal approximation capabilities when multiple constraints are imposed on its final layers and its remaining layers are only sparsely connected. These include a rescaled and shifted Leaky ReLU activation function but not the ReLU activation function. Our construction and representation result is used to exhibit a simple modification of the feed-forward architecture, which can approximate any continuous function with non-pathological growth, uniformly on the entire Euclidean input space. This improves the known capabilities of the feed-forward architecture.
△ Less
Submitted 28 November, 2020; v1 submitted 8 October, 2019;
originally announced October 2019.
-
NEU: A Meta-Algorithm for Universal UAP-Invariant Feature Representation
Authors:
Anastasis Kratsios,
Cody Hyndman
Abstract:
Effective feature representation is key to the predictive performance of any algorithm. This paper introduces a meta-procedure, called Non-Euclidean Upgrading (NEU), which learns feature maps that are expressive enough to embed the universal approximation property (UAP) into most model classes while only outputting feature maps that preserve any model class's UAP. We show that NEU can learn any fe…
▽ More
Effective feature representation is key to the predictive performance of any algorithm. This paper introduces a meta-procedure, called Non-Euclidean Upgrading (NEU), which learns feature maps that are expressive enough to embed the universal approximation property (UAP) into most model classes while only outputting feature maps that preserve any model class's UAP. We show that NEU can learn any feature map with these two properties if that feature map is asymptotically deformable into the identity. We also find that the feature-representations learned by NEU are always submanifolds of the feature space. NEU's properties are derived from a new deep neural model that is universal amongst all orientation-preserving homeomorphisms on the input space. We derive qualitative and quantitative approximation guarantees for this architecture. We quantify the number of parameters required for this new architecture to memorize any set of input-output pairs while simultaneously fixing every point of the input space lying outside some compact set, and we quantify the size of this set as a function of our model's depth. Moreover, we show that no deep feed-forward network with commonly used activation function has all these properties. NEU's performance is evaluated against competing machine learning methods on various regression and dimension reduction tasks both with financial and simulated data.
△ Less
Submitted 10 May, 2021; v1 submitted 31 August, 2018;
originally announced September 2018.